code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+381 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 9 |
## Summary
@@ -46,6 +46,31 @@ keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block o
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
premise does not hold on .NET 8+).
#### Re-review 2026-05-28 (commit `1eb6e97`)
Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
and their resolutions hold up under inspection: the lock-guarded `WithConnection`
overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
wiring are all present and correct at this commit. Nine new findings were recorded —
none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
gets a broken lexicographic comparison and either spuriously includes or excludes
events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
write queue can grow without limit under sustained writer slowness — sister
`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
"prevents broad queries from overwhelming the communication channel" rationale). The
remaining findings are low-severity hygiene / documentation: an unused
`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
parsing, the purge service running on the standby node, the redundant `Cache=Shared`
on a single-connection logger, and a non-volatile stop flag in a concurrency stress
test.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -61,6 +86,21 @@ premise does not hold on .NET 8+).
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -706,3 +746,341 @@ re-triage note). No code change made. A verification test
`StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
(asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
background scheduler).
### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |
**Description**
`SiteEventLogger` creates its background-writer feeder as
`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
makes a single iteration slow — a long-running query in `EventLogQueryService`
holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
from an alarm storm / script failure loop — drives the queue arbitrarily large.
Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
strings, so there is no upper bound on how much memory the recorder can hold.
The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
addresses the same hot-path-writer problem with
`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
giving back-pressure to producers. Site event logging picked the riskier choice for
a component that — per the design — is fed by every site subsystem (script, alarm,
deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.
**Recommendation**
Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
(default in the order of 10 000 — large enough to absorb a normal alarm burst,
small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
back-pressure (callers `await` and serialise their actor thread on the queue —
defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
chosen policy on `ISiteEventLogger.LogEventAsync`.
### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |
**Description**
Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
which always emits the round-trip ISO 8601 form ending in the literal offset
`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
range using a direct string comparison:
```
whereClauses.Add("timestamp >= $from");
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
```
`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
(SQLite default `BINARY` collation), so the query either spuriously excludes events
that genuinely occurred in the range, or spuriously includes events from a wholly
different hour. The same defect applies to `To`. The retention purge does
`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
central query path is vulnerable.
The design explicitly states "All timestamps are UTC throughout the system" but the
boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
A central UI rendered in a non-UTC timezone is the most likely trigger, and the
defect silently corrupts every query that filters by time range — exactly the
filter most likely to be set on a "show me what happened around the failover" query.
**Recommendation**
Normalise `From` / `To` to UTC before serialising:
`request.From.Value.ToUniversalTime().ToString("o")` (or
`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
asserts the matching events are returned. Optionally also store timestamps as
Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
lexicographic-comparison hazard structurally.
### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |
**Description**
`EventLogQueryService.ExecuteQuery` resolves the effective page size as
`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
"has more"). There is no upper bound. A central client — buggy or hostile — can
send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
shared write lock. This:
- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
message serialisation limits, will then be serialised into an
`EventLogQueryResponse` and pushed over the ClusterClient pipe.
- Blocks all writes (purge, recorder hot path) for the duration of the scan
because the read holds `_writeLock`.
- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
queries.
The design explicitly justifies pagination as preventing exactly this — "Results
are paginated with a configurable page size (default: 500 events) ... This prevents
broad queries from overwhelming the communication channel." The code honours the
*default* but does not enforce an *upper bound* on a client-supplied override.
**Recommendation**
Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
exceeds maximum"` reject path so a misbehaving central is told why its query is
refused.
### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |
**Description**
`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
XML doc statement "Surfaced so Health Monitoring can detect a logging outage
instead of relying on a local log line nobody is watching." The implementation is
correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
a repo-wide search shows **no** caller anywhere in `src/` reads the property —
neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
`/health` endpoint. The metric is dead-letter: a logging outage still goes
unnoticed in production, contradicting the original finding's resolution claim.
The property is also exposed only on the concrete `SiteEventLogger`, not on
`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
take a concrete-type dependency (`internal Connection` removed, but
`FailedWriteCount` remained concrete-only).
**Recommendation**
Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
softening the XML doc to "Available for future Health Monitoring integration" and
file a tracking item for the wiring. The current doc claim is misleading.
### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` |
**Description**
`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
which registers the purge `BackgroundService` per host. On a 2-node site cluster
both `node-a` and `node-b` start the service independently, so each runs its own
30-day retention purge and 1 GB cap purge against its own local
`site_events.db`. The design states only "A daily background job runs on the
active node and deletes all events older than 30 days." (Component-SiteEventLogging,
Storage section). In practice the standby node receives no writes, so its purge
finds nothing to delete and is harmless — but the implementation does not match the
documented "active node" gating, and the resolution note on SiteEventLogging-004
already flagged that the *writer* runs on the standby too. The purge has the same
shape.
Aligning to the design is also a defence against a future change that does write
to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
contribute to `Microsoft.Extensions.Hosting` shutdown latency.
**Recommendation**
Either (a) gate the purge service on "this node is the active member of `siteRole`"
(check the cluster singleton ownership before each `RunPurge()`, or host the
purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
the design doc to "the purge runs on every node against its own local database;
on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
defect.
### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |
**Description**
`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
deployment, connection, store_and_forward, instance_lifecycle}, `severity`
{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
verbatim. Two follow-on consequences:
1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
match, case-sensitive by SQLite default `BINARY` collation). A row stored as
`"error"` will not be returned for a query filtering on `"Error"`. The design
lists severity as a first-class filter and the central UI will reasonably
normalise to one casing — every row stored with a different casing is silently
invisible to that filter.
2. The `Events Logged` table in the design implicitly relies on a stable
`event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
silently and is hard to detect later.
**Recommendation**
Validate `eventType` and `severity` against a known set (or accept `enum`s on the
interface, converting to canonical string at the call site). Reject unknown values
with `ArgumentException` and log a single-shot warning during construction if a
deployment is found to be using an unexpected value. Alternatively, normalise
casing (`severity = severity.ToLowerInvariant()`) so the query filter is
case-insensitive. Update the XML doc to match the enforced contract.
### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |
**Description**
`ExecuteQuery` materialises rows via
`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
production node running with a non-default culture (e.g. Turkish "tr-TR", which
has historically broken case-insensitive ASCII comparisons via the
"Turkish-I" issue, or any culture that overrides the date/time separators) can
parse incorrectly or throw `FormatException`. The exception is caught by the outer
`try`, so the entire query is converted to a `Success: false` response — but the
failure mode is silent and culture-dependent.
The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
culture-sensitive in the same way; on a hostile-culture node, the round-trip
between insert and query is not guaranteed to be lossless without explicit
culture pinning.
**Recommendation**
Parse with explicit invariant culture and round-trip style:
`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
avoid all string-parsing.
### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |
**Description**
The connection string is built as
`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
shared-cache mode is a *cross-connection* optimisation: it lets multiple
`SqliteConnection`s in the same process share an in-process page cache. This
logger owns exactly one `SqliteConnection` and serialises all access through
`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
At best it is dead configuration; at worst it adds (very small) per-statement
lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
unused option, so the smell is a copy-and-paste pattern.
Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
gun if anyone later opens a second connection to the same file from another
component on the same host (e.g. a tooling read-only viewer).
**Recommendation**
Drop `Cache=Shared` from the connection string — the logger is single-connection
and gains nothing from it. If a future need to share the DB across connections in
the same process arises, reintroduce it deliberately together with the busy_timeout
and locking_mode review that should accompany it.
### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |
**Description**
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
that the main test thread mutates after the purge task completes
(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
and not behind a memory barrier. On a release build with a relaxed memory model
the writer threads are permitted to cache the `stop = false` read indefinitely,
which means in theory the test can hang past xUnit's per-test timeout instead of
asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
boundary, but that is an implementation detail rather than a contract.
The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
version of it can mask the very behaviour it is meant to pin.
**Recommendation**
Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
cooperate with xUnit's `Task.WhenAll` timeout.