Files
scadalink-design/code-reviews/SiteEventLogging/findings.md

709 lines
37 KiB
Markdown

# Code Review — SiteEventLogging
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
## Summary
The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
recorder (`SiteEventLogger`), a query service with keyset pagination, a background
purge service, and a thin Akka actor bridge. The query path is parameterised
correctly (no SQL injection) and reasonably well tested. However, the storage-cap
enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because
`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the
database shrink and over-deletes the entire table when triggered. There is also a
genuine concurrency hazard: the purge service and query service share the single
`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge
running on the background thread can collide with a write or a query on another
thread. The `LogEventAsync` API is synchronous despite its name and `Task` return,
which silently blocks Akka actor threads on disk I/O. Other findings concern the
cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
hold up under inspection — the background writer, lock-guarded `WithConnection`,
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
the concrete-recorder DI wiring are all present and correct at this commit. The
module source is byte-identical between `39d737e` and current `HEAD`, so this review
reflects the live code. Three new findings were recorded, all low-to-medium and none
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
be persisted because the logger has been disposed, the returned `Task` is completed
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block on the
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
premise does not hold on .NET 8+).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: claimed initial-purge startup-thread block (-014 — Won't Fix, premise invalid on .NET 8+). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |
**Description**
`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to
reclaim space, then re-measures the database size via `page_count * page_size`.
`incremental_vacuum` only has any effect when the database was created with
`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the
database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`,
`incremental_vacuum` is silently ignored and `page_count` does not decrease when
rows are deleted (free pages are retained in the file). Consequently the
`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The
storage-cap feature required by the design ("configurable maximum database size...
oldest events are purged first") is therefore non-functional — it cannot bring the
file back under the cap.
**Recommendation**
Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are
created (it must be set before table creation or followed by a full `VACUUM` to take
effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge
deletes, or measure logical data size (e.g. `page_count - freelist_count` times
`page_size`) instead of relying on `incremental_vacuum`.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`): `InitializeSchema` now sets
`PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and
`GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) *
page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now
reliably observes the database shrinking. Regression test
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`.
### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |
**Description**
Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row
batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table
is completely empty. The design states the cap should purge "the oldest events...
first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
during an alarm storm) this destroys all retained diagnostic history rather than
trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap`
masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an
empty table, so the over-delete behaviour is never exercised against a realistic cap.
**Recommendation**
Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
the file is genuinely under the cap. Add a guard so the loop stops once
`currentSizeBytes` has stopped decreasing across iterations, and add a test with a
non-zero cap and a known oversized dataset to assert that only the oldest events are
removed.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`): with the size measurement fixed
(SiteEventLogging-001) the cap loop terminates when the file is genuinely under the
cap. An additional guard stops the loop if the on-disk size fails to decrease across
an iteration, so a cap that can never be met no longer empties the whole table.
Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the
table is not emptied and the file ends under a realistic non-zero cap) and
`PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are
removed).
### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |
**Description**
`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via
`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach
into `_eventLogger.Connection` and execute commands directly, without acquiring
`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from
event-recording callers and from the actor that drives the query service). A single
`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the
purge thread and a recording thread (or query thread) can throw
`SqliteException`/`InvalidOperationException` ("DataReader already open",
"connection busy") or corrupt command state. The purge `DELETE` and the recorder
`INSERT` racing is the most likely collision because event recording is continuous.
**Recommendation**
Funnel all access to the connection through a single synchronisation point: either
expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the
purge and query services their own dedicated `SqliteConnection` instances (SQLite
supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout`
makes this safer). Do not share one `SqliteConnection` across threads.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`): the raw `internal Connection` property was
removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)`
overloads that hold the existing `_writeLock` for the duration of the caller's
delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now
access the connection exclusively through `WithConnection`, so the purge thread,
query thread, and recording threads are serialised on a single lock. `Dispose` was
also brought under the lock to avoid a dispose/use race. Regression test
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running
concurrently with multiple writer threads.
### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
| | |
|--|--|
| Severity | Low |
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
| Category | Design-document adherence |
| Status | Won't Fix |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |
**Description**
`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the
stated intent that "queries always reach the active node". However, an Akka.NET
cluster singleton is pinned to the *oldest* member of the role, which is not the
same concept as the SCADA "active node" (the node currently running the Deployment
Manager singleton / serving live traffic). The design doc is explicit: "Only the
active node generates and stores events... the new active node starts logging to its
own SQLite database." The event-log SQLite file is node-local and unreplicated.
Nothing guarantees the event-log singleton co-locates with the active node, so a
remote query can be served by the standby node and read that node's near-empty
database, returning no events even though the active node has a full log. The
explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually
happens.
**Recommendation**
Either (a) host the query handler as a normal per-node actor and route queries to
the active node explicitly (the node owning the Deployment Manager singleton), or
(b) make the event-log writer follow the same singleton so the writer and the query
handler are guaranteed co-located. Reconcile the design doc and the inline comment
with whichever model is chosen.
**Re-triage note (2026-05-16)**
The finding's central claim — that a remote query "can be served by the standby
node and read that node's near-empty database" — is incorrect for the query path.
In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the
`deployment-manager` `ClusterSingletonManager` are created with the **same role**
(`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a
given role to the *oldest member of that role* — so all same-role singletons in a
cluster co-locate on one node. The "active node" in this codebase is, by definition,
the node hosting the `deployment-manager` singleton; the event-log query singleton
is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot
land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not
"the opposite of what happens".
A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered
as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local
SQLite file on **every** node, including the standby. That wastes storage on the
standby but does **not** cause the query-returns-nothing symptom the finding
describes, because the query singleton always reads the *active* node's (populated)
database. Gating the writer to the active node would be a `ScadaLink.Host` wiring
change, outside this module's scope, and is a minor optimisation rather than a
correctness defect.
Re-triaged from High to Low and closed as **Won't Fix**: the High-severity
correctness claim does not hold. Any residual cleanup (gate the standby-node writer;
the comment needs no change) can be raised as a fresh Low finding against
`ScadaLink.Host` if desired.
**Resolution**
Won't Fix — 2026-05-16 (commit `<pending>`). Re-triaged: the asserted defect (query
served by standby returning an empty log) cannot occur because the event-log query
singleton and the deployment-manager singleton share a role and so always co-locate
on the active node. No code change made; see the re-triage note above.
### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |
**Description**
`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its
body is entirely synchronous: it takes `lock (_writeLock)`, runs
`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`.
Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)`
(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`,
`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write,
and any contention on `_writeLock`, executes inline on the Akka actor thread of the
calling subsystem. Under an event burst (alarm storm, script failure loop) this
serialises actor threads on disk I/O and the global write lock, degrading the
hot-path subsystems the design intends to keep responsive.
**Recommendation**
Either make recording genuinely asynchronous (offload to a dedicated single-threaded
writer / `Channel<T>` consumer so callers truly fire-and-forget), or rename the
method to `LogEvent` and document that it blocks, so callers can decide. Given the
design's emphasis on not impacting runtime subsystems, an internal queue with a
background flush is preferable.
**Resolution**
Resolved 2026-05-16 (commit `pending`): event recording is now offloaded to a
dedicated background writer. `SiteEventLogger` owns an unbounded `Channel<T>` and a
single background consumer thread; `LogEventAsync` only validates its arguments and
enqueues, so caller threads (Akka actor threads on hot paths) never block on the
SQLite write or on contention for the write lock. The returned `Task` completes once
the event is durably persisted (so `await` callers still observe write ordering) and
faults if the write fails. `Dispose` completes the channel and drains the writer.
Regression test `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` (verifies the
caller returns in <500 ms while the database is held busy) plus
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`.
### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` |
**Description**
`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`.
The query service also filters on `severity` (`severity = $severity`) and performs
`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no
index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a
1 GB database and a 500-row page size, severity-filtered and keyword queries do full
table scans on every page. The design explicitly lists keyword search as a supported,
expected query type.
**Recommendation**
Add an index on `severity` (or a composite index aligned with common filter
combinations such as `(event_type, severity, id)`). For keyword search, consider an
FTS5 virtual table over `message` and `source`, or accept the scan but document the
cost.
**Resolution**
Resolved 2026-05-16 (commit `pending`): `InitializeSchema` now creates
`idx_events_severity` on `site_events(severity)`, so severity-filtered queries use an
index instead of full-scanning. The leading-wildcard `LIKE` keyword search cannot use
a B-tree index by construction; rather than adding an FTS5 table, that scan is
accepted and now documented with an inline comment next to the index DDL. Regression
tests `Schema_HasIndexOnSeverity` and `SeverityFilteredQuery_UsesIndex_NotFullScan`
(the latter asserts the SQLite query plan uses `idx_events_severity` and does not
`SCAN site_events`).
### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type
| | |
|--|--|
| Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23` |
**Description**
Both `EventLogPurgeService` and `EventLogQueryService` took `ISiteEventLogger` via
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. This
made the registration fragile — any `ISiteEventLogger` that is not exactly
`SiteEventLogger` (a test double, a decorator) caused an `InvalidCastException` at
construction — and defeated the purpose of the interface abstraction.
**Re-triage note (2026-05-16)**
The finding as originally written also claimed the services "access the `internal
SqliteConnection Connection` property to run arbitrary SQL" and called itself "the
root cause of the unsynchronised connection sharing in SiteEventLogging-003". That
part is stale: the resolution of SiteEventLogging-003 had already removed the
`internal Connection` property and replaced it with lock-guarded `WithConnection`
overloads. At the time this finding was actioned, the only remaining defect was the
concrete-type downcast itself. Severity stays Medium; the description is corrected to
the downcast-only scope.
**Recommendation**
Have the purge and query services depend on the concrete `SiteEventLogger` directly
(it is the type that owns the lock-guarded `WithConnection`), and register the
concrete type in DI with the interface forwarded to the same singleton. Remove the
fragile downcasts.
**Resolution**
Resolved 2026-05-16 (commit `pending`): `AddSiteEventLogging` now registers
`SiteEventLogger` as a concrete singleton and forwards `ISiteEventLogger` to that
same instance. `EventLogPurgeService` and `EventLogQueryService` take a
`SiteEventLogger` constructor parameter directly, eliminating the
`(SiteEventLogger)eventLogger` downcast and its `InvalidCastException` risk. All
three services still share one connection/lock. Regression tests
`AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance` and
`PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast`.
### SiteEventLogging-008 — Event-recording write failures are silently swallowed
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |
**Description**
If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the
exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still
returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so
they cannot detect failure. The event log is the site's diagnostic audit trail; a
sustained write failure (for example a locked-database storm caused by the
unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
operators except a local log line that nobody is watching. There is no failure
counter, no health-metric hook, and no retry.
**Recommendation**
Expose a failure signal: increment a counter that the Health Monitoring component
can surface (the design notes script/alarm error rates are derived from the event
log — a logging outage should be visible). At minimum, escalate repeated failures to
a Warning/Error health metric rather than only a local log line.
**Resolution**
Resolved 2026-05-16 (commit `pending`): write failures are no longer swallowed. The
background writer (introduced for SiteEventLogging-005) now, on an `INSERT` failure,
(a) increments a new `Interlocked`-guarded counter exposed as the public
`SiteEventLogger.FailedWriteCount` property — which Health Monitoring can poll to
detect a logging outage — and (b) faults the `Task` returned by `LogEventAsync` with
the exception instead of returning `Task.CompletedTask`. The error is still logged.
Regression test `LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError`.
### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` |
**Description**
The interface XML doc states "Record an event asynchronously." and the method is
named `LogEventAsync`, but the implementation is fully synchronous (see
SiteEventLogging-005). The documentation and naming are misleading: a reader will
reasonably assume the write is offloaded and the caller's thread is not blocked,
which is false. The `details` parameter doc says "Optional JSON details" but nothing
validates or requires JSON, so callers may pass arbitrary text.
**Re-triage note (2026-05-16)**
The finding's central premise is now stale. SiteEventLogging-005 was resolved by
introducing a dedicated background writer: `LogEventAsync` enqueues onto an unbounded
`Channel<T>` and returns without blocking, so the method is now *genuinely*
asynchronous and the name `LogEventAsync` is accurate. No rename is warranted. The
only residual documentation defect was the imprecise XML doc (the one-line summary
did not describe the enqueue/await semantics) and the `details` doc claiming "JSON"
when nothing enforces it. Scope corrected to those doc fixes only.
**Recommendation**
Align the name, signature, and documentation with the actual behaviour — either make
the method genuinely asynchronous or rename to `LogEvent` and correct the doc.
Clarify that `details` is free-form text unless JSON is actually enforced.
**Resolution**
Resolved 2026-05-16 (commit `pending`): the `ISiteEventLogger.LogEventAsync` XML doc
now describes the actual behaviour — the call enqueues onto a background writer and
returns without blocking, and the returned `Task` completes on durable persistence
(or faults on write failure). The `details` parameter doc was corrected to "Optional
free-form detail text ... JSON is conventional but not validated or enforced". No
rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so
`LogEventAsync` is accurate (see Re-triage note). Documentation-only change; no
regression test added (behaviour is already covered by the SiteEventLogging-005
tests `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` and
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`).
### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |
**Description**
The test suite covers recording, query filtering/pagination, and basic purge, but
several critical behaviours are untested:
- `EventLogHandlerActor` has no test — the actor message contract
(`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified.
- No test exercises purge running concurrently with active writes/queries, so the
connection-sharing race (SiteEventLogging-003) is invisible to CI.
- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the
no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the
file shrinks or that only oldest events are removed under a realistic cap.
- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`,
`ErrorMessage`) has no test.
- `SiteEventLogger.Dispose` semantics (logging after dispose returns
`Task.CompletedTask`) and re-entrant dispose are untested.
**Recommendation**
Add tests for the actor bridge, a concurrency stress test (purge + write + query in
parallel), a realistic non-zero-cap purge test asserting size reduction and
oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).
**Resolution**
Resolved 2026-05-16 (commit `pending`): the remaining coverage gaps are now filled.
`EventLogHandlerActorTests` (using `Akka.TestKit.Xunit2`) exercises the actor message
contract — `EventLogQueryRequest` -> `EventLogQueryResponse` via `Sender.Tell`, for
both success and error responses. `EventLogCoverageTests` covers the previously
untested `EventLogQueryService.ExecuteQuery` catch block
(`ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable`) and the recorder's
disposed-state semantics (`LogEventAsync_AfterDispose_CompletesWithoutThrowing`,
`Dispose_IsIdempotent`). The purge/write concurrency, realistic-cap, and
oldest-first behaviours were already covered by the tests added when
SiteEventLogging-001/-002/-003 were resolved
(`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection`,
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`,
`PurgeByStorageCap_RemovesOldestEventsFirst`).
### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` |
**Description**
`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka
actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already
implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through
this method. The placeholder is dead code: it is either never called or called with
no effect, and the comment is stale. A reader looking for where the event-log actor
is wired up will be misdirected.
**Recommendation**
Either implement the actor registration here and have `AkkaHostedService` call it
(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove
the misleading comment.
**Resolution**
Resolved 2026-05-16 (commit `pending`): confirmed dead code — a repo-wide search
found zero callers of `AddSiteEventLoggingActors`, and `EventLogHandlerActor` is in
fact wired up directly in `ScadaLink.Host/Actors/AkkaHostedService.cs` as a cluster
singleton (it must be created inside the `ActorSystem` with a resolved
`IEventLogQueryService`, which a `IServiceCollection` extension cannot do). The empty
placeholder method and its stale "Phase 4+" comment were deleted, and a short
explanatory note added to `AddSiteEventLogging` pointing readers to where the actor
is actually registered. Documentation/dead-code change only; no regression test was
added — the change is a method removal verified by the compiler (no callers) and the
full module suite still passing.
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
**Description**
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
SiteEventLogging-009), "completes once the event is durably persisted and faults if
the write fails, so callers that `await` it observe success or failure." Two paths
break that contract by signalling **success** for an event that was never written:
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
has been completed because the logger was disposed), the code calls
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
though the comment immediately above acknowledges "there is nowhere to persist the
event."
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
been disposed mid-drain. The code does not inspect the returned `written` flag and
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
for an event the comment admits "simply cannot be persisted."
The event log is the site's diagnostic audit trail. A caller that `await`s
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
recorded will observe a *successful* completion for an event that was silently
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
*errors* — but the disposed-drop path was left reporting false success. The window
is the disposal/shutdown interval, during which shutdown-related events (graceful
singleton handover, instance disable) are exactly the events most likely to be
enqueued and lost.
**Recommendation**
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
deliberate "drop silently on shutdown" semantics is chosen instead.
**Resolution**
Resolved 2026-05-17 (commit `<pending>`): both disposed-drop paths now fault the
returned `Task` with `ObjectDisposedException` instead of `TrySetResult()`
`LogEventAsync` on a failed `TryWrite`, and `ProcessWriteQueueAsync` now inspects the
`written` flag from `WithConnection` and only calls `TrySetResult()` when the row was
actually persisted. An `await`-ing caller can now distinguish a dropped audit event
from a persisted one. Regression tests
`LogEventAsync_AfterDispose_FaultsTask_NotReportsSuccess` and
`LogEventAsync_EnqueuedThenDisposed_FaultsTask_WhenWriteCannotComplete` (the prior
`LogEventAsync_AfterDispose_CompletesWithoutThrowing` test asserted the now-incorrect
silent-success behaviour and was replaced).
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
**Description**
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
user-supplied keyword. A search for a literal `_` (common in event sources and
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
"match any single character", and a `%` matches any run of characters. The design
calls keyword search "free-text search on message and source fields ... Useful for
finding events by script name, alarm name, or error message" — users will reasonably
expect a literal substring match, so a query for `store_and_forward` silently returns
events containing `storeXandYforward` and similar false positives. There is no way
for the caller to search for a literal underscore or percent.
**Recommendation**
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
design implies.
**Resolution**
Resolved 2026-05-17 (commit `<pending>`): `EventLogQueryService` now escapes `\`,
`%`, and `_` in `request.KeywordFilter` via a new `EscapeLikePattern` helper before
wrapping it in `%...%`, and the `LIKE` clauses carry an explicit `ESCAPE '\'` so the
keyword is matched as a literal substring as the design intends. Regression tests
`Query_KeywordSearch_TreatsUnderscoreAsLiteral_NotWildcard`,
`Query_KeywordSearch_TreatsPercentAsLiteral_NotWildcard`, and
`Query_KeywordSearch_StillMatchesPlainSubstring`.
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Won't Fix |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
**Description**
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
near-1 GB database is back under the cap — executes inline on the startup thread,
blocking host startup (and therefore the `/health/ready` gate) for as long as the
purge takes. On a site that has accumulated a large log this can be a multi-second
stall during every node start/failover. The class doc states the service "runs on a
background thread and does not block event recording" — the startup-thread block is
inconsistent with that intent.
**Recommendation**
Yield before the initial purge so it runs on the background scheduler rather than the
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
**Re-triage note (2026-05-17)**
The finding's central premise — "the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`", so the
synchronous `RunPurge()` prelude blocks the startup thread — is **incorrect for the
.NET version this project targets**. The module targets `net10.0`. Since .NET 8,
`BackgroundService.StartAsync` no longer runs `ExecuteAsync` inline on the calling
(host startup) thread: `ExecuteAsync` is dispatched onto the thread pool, and
`StartAsync` returns immediately regardless of how long the synchronous prelude runs.
This was verified empirically. A standalone `BackgroundService` whose `ExecuteAsync`
prelude does a 1.5 s synchronous `Thread.Sleep` before its first `await` showed
`StartAsync` returning in **0 ms**, with the prelude running on a *different* thread
pool thread than the caller. Applied to `EventLogPurgeService`, `StartAsync` returns
promptly and the initial `RunPurge()` executes on the background scheduler — host
startup and the `/health/ready` gate are not blocked. There is therefore no defect
to fix.
**Resolution**
Won't Fix — 2026-05-17 (commit `<pending>`). Re-triaged: the asserted startup-thread
block cannot occur on .NET 8+ (this module targets `net10.0`), where
`BackgroundService` dispatches `ExecuteAsync` to the thread pool rather than running
its synchronous prelude on the host startup thread — verified empirically (see the
re-triage note). No code change made. A verification test
`StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
(asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
background scheduler).