docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine
11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.SiteEventLogging` |
|
||||
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby
|
||||
node), missing indexes for common query filters, retention/cap purge not enforcing
|
||||
the requirement strictly, and several documentation/maintainability issues.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
|
||||
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
|
||||
hold up under inspection — the background writer, lock-guarded `WithConnection`,
|
||||
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
|
||||
the concrete-recorder DI wiring are all present and correct at this commit. The
|
||||
module source is byte-identical between `39d737e` and current `HEAD`, so this review
|
||||
reflects the live code. Three new findings were recorded, all low-to-medium and none
|
||||
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
|
||||
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
|
||||
be persisted because the logger has been disposed, the returned `Task` is completed
|
||||
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
|
||||
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
|
||||
keyword-search filter (SiteEventLogging-013) and the initial purge running
|
||||
synchronously on the host startup thread (SiteEventLogging-014).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
|
||||
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
|
||||
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
|
||||
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
|
||||
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
|
||||
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
|
||||
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). |
|
||||
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
|
||||
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
|
||||
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
|
||||
@@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac
|
||||
is actually registered. Documentation/dead-code change only; no regression test was
|
||||
added — the change is a method removal verified by the compiler (no callers) and the
|
||||
full module suite still passing.
|
||||
|
||||
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
|
||||
SiteEventLogging-009), "completes once the event is durably persisted and faults if
|
||||
the write fails, so callers that `await` it observe success or failure." Two paths
|
||||
break that contract by signalling **success** for an event that was never written:
|
||||
|
||||
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
|
||||
has been completed because the logger was disposed), the code calls
|
||||
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
|
||||
though the comment immediately above acknowledges "there is nowhere to persist the
|
||||
event."
|
||||
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
|
||||
been disposed mid-drain. The code does not inspect the returned `written` flag and
|
||||
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
|
||||
for an event the comment admits "simply cannot be persisted."
|
||||
|
||||
The event log is the site's diagnostic audit trail. A caller that `await`s
|
||||
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
|
||||
recorded will observe a *successful* completion for an event that was silently
|
||||
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
|
||||
*errors* — but the disposed-drop path was left reporting false success. The window
|
||||
is the disposal/shutdown interval, during which shutdown-related events (graceful
|
||||
singleton handover, instance disable) are exactly the events most likely to be
|
||||
enqueued and lost.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
|
||||
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
|
||||
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
|
||||
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
|
||||
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
|
||||
deliberate "drop silently on shutdown" semantics is chosen instead.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
|
||||
|
||||
**Description**
|
||||
|
||||
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
|
||||
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
|
||||
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
|
||||
user-supplied keyword. A search for a literal `_` (common in event sources and
|
||||
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
|
||||
"match any single character", and a `%` matches any run of characters. The design
|
||||
calls keyword search "free-text search on message and source fields ... Useful for
|
||||
finding events by script name, alarm name, or error message" — users will reasonably
|
||||
expect a literal substring match, so a query for `store_and_forward` silently returns
|
||||
events containing `storeXandYforward` and similar false positives. There is no way
|
||||
for the caller to search for a literal underscore or percent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
|
||||
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
|
||||
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
|
||||
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
|
||||
design implies.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
|
||||
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
|
||||
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
|
||||
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
|
||||
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
|
||||
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
|
||||
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
|
||||
near-1 GB database is back under the cap — executes inline on the startup thread,
|
||||
blocking host startup (and therefore the `/health/ready` gate) for as long as the
|
||||
purge takes. On a site that has accumulated a large log this can be a multi-second
|
||||
stall during every node start/failover. The class doc states the service "runs on a
|
||||
background thread and does not block event recording" — the startup-thread block is
|
||||
inconsistent with that intent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Yield before the initial purge so it runs on the background scheduler rather than the
|
||||
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
|
||||
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
|
||||
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user