code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.SiteEventLogging` |
 | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 9 |

 ## Summary

@@ -46,6 +46,31 @@ keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block o
 host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
 premise does not hold on .NET 8+).

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
+and their resolutions hold up under inspection: the lock-guarded `WithConnection`
+overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
+propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
+severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
+wiring are all present and correct at this commit. Nine new findings were recorded —
+none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
+is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
+generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
+but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
+without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
+gets a broken lexicographic comparison and either spuriously includes or excludes
+events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
+write queue can grow without limit under sustained writer slowness — sister
+`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
+client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
+"prevents broad queries from overwhelming the communication channel" rationale). The
+remaining findings are low-severity hygiene / documentation: an unused
+`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
+parsing, the purge service running on the standby node, the redundant `Cache=Shared`
+on a single-connection logger, and a non-volatile stop flag in a concurrency stress
+test.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -61,6 +86,21 @@ premise does not hold on .NET 8+).
 | 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
 | 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
+| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
+| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
+| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
+| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
+| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
+| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
+| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
+| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
+| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
+
 ## Findings

 ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -706,3 +746,341 @@ re-triage note). No code change made. A verification test
 `StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
 (asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
 background scheduler).
+
+### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |
+
+**Description**
+
+`SiteEventLogger` creates its background-writer feeder as
+`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
+through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
+makes a single iteration slow — a long-running query in `EventLogQueryService`
+holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
+`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
+from an alarm storm / script failure loop — drives the queue arbitrarily large.
+Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
+strings, so there is no upper bound on how much memory the recorder can hold.
+
+The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
+addresses the same hot-path-writer problem with
+`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
+giving back-pressure to producers. Site event logging picked the riskier choice for
+a component that — per the design — is fed by every site subsystem (script, alarm,
+deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
+a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.
+
+**Recommendation**
+
+Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
+(default in the order of 10 000 — large enough to absorb a normal alarm burst,
+small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
+back-pressure (callers `await` and serialise their actor thread on the queue —
+defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
+`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
+the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
+chosen policy on `ISiteEventLogger.LogEventAsync`.
+
+### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |
+
+**Description**
+
+Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
+which always emits the round-trip ISO 8601 form ending in the literal offset
+`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
+range using a direct string comparison:
+
+```
+whereClauses.Add("timestamp >= $from");
+parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
+```
+
+`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
+the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
+example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
+string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
+than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
+comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
+(SQLite default `BINARY` collation), so the query either spuriously excludes events
+that genuinely occurred in the range, or spuriously includes events from a wholly
+different hour. The same defect applies to `To`. The retention purge does
+`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
+central query path is vulnerable.
+
+The design explicitly states "All timestamps are UTC throughout the system" but the
+boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
+A central UI rendered in a non-UTC timezone is the most likely trigger, and the
+defect silently corrupts every query that filters by time range — exactly the
+filter most likely to be set on a "show me what happened around the failover" query.
+
+**Recommendation**
+
+Normalise `From` / `To` to UTC before serialising:
+`request.From.Value.ToUniversalTime().ToString("o")` (or
+`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
+regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
+asserts the matching events are returned. Optionally also store timestamps as
+Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
+lexicographic-comparison hazard structurally.
+
+### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |
+
+**Description**
+
+`EventLogQueryService.ExecuteQuery` resolves the effective page size as
+`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
+and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
+"has more"). There is no upper bound. A central client — buggy or hostile — can
+send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
+entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
+shared write lock. This:
+
+- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
+  message serialisation limits, will then be serialised into an
+  `EventLogQueryResponse` and pushed over the ClusterClient pipe.
+- Blocks all writes (purge, recorder hot path) for the duration of the scan
+  because the read holds `_writeLock`.
+- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
+  queries.
+
+The design explicitly justifies pagination as preventing exactly this — "Results
+are paginated with a configurable page size (default: 500 events) ... This prevents
+broad queries from overwhelming the communication channel." The code honours the
+*default* but does not enforce an *upper bound* on a client-supplied override.
+
+**Recommendation**
+
+Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
+default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
+a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
+that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
+exceeds maximum"` reject path so a misbehaving central is told why its query is
+refused.
+
+### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |
+
+**Description**
+
+`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
+XML doc statement "Surfaced so Health Monitoring can detect a logging outage
+instead of relying on a local log line nobody is watching." The implementation is
+correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
+a repo-wide search shows **no** caller anywhere in `src/` reads the property —
+neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
+`/health` endpoint. The metric is dead-letter: a logging outage still goes
+unnoticed in production, contradicting the original finding's resolution claim.
+
+The property is also exposed only on the concrete `SiteEventLogger`, not on
+`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
+take a concrete-type dependency (`internal Connection` removed, but
+`FailedWriteCount` remained concrete-only).
+
+**Recommendation**
+
+Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
+pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
+promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
+so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
+softening the XML doc to "Available for future Health Monitoring integration" and
+file a tracking item for the wiring. The current doc claim is misleading.
+
+### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` |
+
+**Description**
+
+`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
+which registers the purge `BackgroundService` per host. On a 2-node site cluster
+both `node-a` and `node-b` start the service independently, so each runs its own
+30-day retention purge and 1 GB cap purge against its own local
+`site_events.db`. The design states only "A daily background job runs on the
+active node and deletes all events older than 30 days." (Component-SiteEventLogging,
+Storage section). In practice the standby node receives no writes, so its purge
+finds nothing to delete and is harmless — but the implementation does not match the
+documented "active node" gating, and the resolution note on SiteEventLogging-004
+already flagged that the *writer* runs on the standby too. The purge has the same
+shape.
+
+Aligning to the design is also a defence against a future change that does write
+to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
+contribute to `Microsoft.Extensions.Hosting` shutdown latency.
+
+**Recommendation**
+
+Either (a) gate the purge service on "this node is the active member of `siteRole`"
+(check the cluster singleton ownership before each `RunPurge()`, or host the
+purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
+the design doc to "the purge runs on every node against its own local database;
+on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
+defect.
+
+### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |
+
+**Description**
+
+`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
+The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
+deployment, connection, store_and_forward, instance_lifecycle}, `severity` ∈
+{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
+pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
+verbatim. Two follow-on consequences:
+
+1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
+   match, case-sensitive by SQLite default `BINARY` collation). A row stored as
+   `"error"` will not be returned for a query filtering on `"Error"`. The design
+   lists severity as a first-class filter and the central UI will reasonably
+   normalise to one casing — every row stored with a different casing is silently
+   invisible to that filter.
+2. The `Events Logged` table in the design implicitly relies on a stable
+   `event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
+   silently and is hard to detect later.
+
+**Recommendation**
+
+Validate `eventType` and `severity` against a known set (or accept `enum`s on the
+interface, converting to canonical string at the call site). Reject unknown values
+with `ArgumentException` and log a single-shot warning during construction if a
+deployment is found to be using an unexpected value. Alternatively, normalise
+casing (`severity = severity.ToLowerInvariant()`) so the query filter is
+case-insensitive. Update the XML doc to match the enforced contract.
+
+### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |
+
+**Description**
+
+`ExecuteQuery` materialises rows via
+`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
+`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
+8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
+production node running with a non-default culture (e.g. Turkish "tr-TR", which
+has historically broken case-insensitive ASCII comparisons via the
+"Turkish-I" issue, or any culture that overrides the date/time separators) can
+parse incorrectly or throw `FormatException`. The exception is caught by the outer
+`try`, so the entire query is converted to a `Success: false` response — but the
+failure mode is silent and culture-dependent.
+
+The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
+culture-sensitive in the same way; on a hostile-culture node, the round-trip
+between insert and query is not guaranteed to be lossless without explicit
+culture pinning.
+
+**Recommendation**
+
+Parse with explicit invariant culture and round-trip style:
+`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
+DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
+emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
+Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
+avoid all string-parsing.
+
+### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |
+
+**Description**
+
+The connection string is built as
+`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
+shared-cache mode is a *cross-connection* optimisation: it lets multiple
+`SqliteConnection`s in the same process share an in-process page cache. This
+logger owns exactly one `SqliteConnection` and serialises all access through
+`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
+At best it is dead configuration; at worst it adds (very small) per-statement
+lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
+unused option, so the smell is a copy-and-paste pattern.
+
+Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
+`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
+gun if anyone later opens a second connection to the same file from another
+component on the same host (e.g. a tooling read-only viewer).
+
+**Recommendation**
+
+Drop `Cache=Shared` from the connection string — the logger is single-connection
+and gains nothing from it. If a future need to share the DB across connections in
+the same process arises, reintroduce it deliberately together with the busy_timeout
+and locking_mode review that should accompany it.
+
+### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |
+
+**Description**
+
+`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
+that the main test thread mutates after the purge task completes
+(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
+The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
+and not behind a memory barrier. On a release build with a relaxed memory model
+the writer threads are permitted to cache the `stop = false` read indefinitely,
+which means in theory the test can hang past xUnit's per-test timeout instead of
+asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
+that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
+boundary, but that is an implementation detail rather than a contract.
+
+The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
+version of it can mask the very behaviour it is meant to pin.
+
+**Recommendation**
+
+Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
+`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
+`CancellationTokenSource` is the canonical .NET pattern and also lets the test
+cooperate with xUnit's `Task.WhenAll` timeout.