docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine

11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
This commit is contained in:
Joseph Doherty
2026-05-17 00:51:58 -04:00
parent 3b3760f026
commit 0ba4e49e11
5 changed files with 613 additions and 27 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
hold up under inspection — the background writer, lock-guarded `WithConnection`,
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
the concrete-recorder DI wiring are all present and correct at this commit. The
module source is byte-identical between `39d737e` and current `HEAD`, so this review
reflects the live code. Three new findings were recorded, all low-to-medium and none
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
be persisted because the logger has been disposed, the returned `Task` is completed
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
keyword-search filter (SiteEventLogging-013) and the initial purge running
synchronously on the host startup thread (SiteEventLogging-014).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
@@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac
is actually registered. Documentation/dead-code change only; no regression test was
added — the change is a method removal verified by the compiler (no callers) and the
full module suite still passing.
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
**Description**
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
SiteEventLogging-009), "completes once the event is durably persisted and faults if
the write fails, so callers that `await` it observe success or failure." Two paths
break that contract by signalling **success** for an event that was never written:
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
has been completed because the logger was disposed), the code calls
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
though the comment immediately above acknowledges "there is nowhere to persist the
event."
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
been disposed mid-drain. The code does not inspect the returned `written` flag and
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
for an event the comment admits "simply cannot be persisted."
The event log is the site's diagnostic audit trail. A caller that `await`s
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
recorded will observe a *successful* completion for an event that was silently
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
*errors* — but the disposed-drop path was left reporting false success. The window
is the disposal/shutdown interval, during which shutdown-related events (graceful
singleton handover, instance disable) are exactly the events most likely to be
enqueued and lost.
**Recommendation**
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
deliberate "drop silently on shutdown" semantics is chosen instead.
**Resolution**
_Unresolved._
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
**Description**
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
user-supplied keyword. A search for a literal `_` (common in event sources and
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
"match any single character", and a `%` matches any run of characters. The design
calls keyword search "free-text search on message and source fields ... Useful for
finding events by script name, alarm name, or error message" — users will reasonably
expect a literal substring match, so a query for `store_and_forward` silently returns
events containing `storeXandYforward` and similar false positives. There is no way
for the caller to search for a literal underscore or percent.
**Recommendation**
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
design implies.
**Resolution**
_Unresolved._
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
**Description**
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
near-1 GB database is back under the cap — executes inline on the startup thread,
blocking host startup (and therefore the `/health/ready` gate) for as long as the
purge takes. On a site that has accumulated a large log this can be a multi-second
stall during every node start/failover. The class doc states the service "runs on a
background thread and does not block event recording" — the startup-thread block is
inconsistent with that intent.
**Recommendation**
Yield before the initial purge so it runs on the background scheduler rather than the
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
**Resolution**
_Unresolved._