# Code Review — SiteEventLogging | Field | Value | |-------|-------| | Module | `src/ScadaLink.SiteEventLogging` | | Design doc | `docs/requirements/Component-SiteEventLogging.md` | | Status | Reviewed | | Last reviewed | 2026-05-17 | | Reviewer | claude-agent | | Commit reviewed | `39d737e` | | Open findings | 3 | ## Summary The SiteEventLogging module is small and broadly well-structured: a SQLite-backed recorder (`SiteEventLogger`), a query service with keyset pagination, a background purge service, and a thin Akka actor bridge. The query path is parameterised correctly (no SQL injection) and reasonably well tested. However, the storage-cap enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because `auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the database shrink and over-deletes the entire table when triggered. There is also a genuine concurrency hazard: the purge service and query service share the single `SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge running on the background thread can collide with a write or a query on another thread. The `LogEventAsync` API is synchronous despite its name and `Task` return, which silently blocks Akka actor threads on disk I/O. Other findings concern the cluster-singleton placement of the handler actor (which can pin to the standby node), missing indexes for common query filters, retention/cap purge not enforcing the requirement strictly, and several documentation/maintainability issues. #### Re-review 2026-05-17 (commit `39d737e`) Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed (SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions hold up under inspection — the background writer, lock-guarded `WithConnection`, `auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and the concrete-recorder DI wiring are all present and correct at this commit. The module source is byte-identical between `39d737e` and current `HEAD`, so this review reflects the live code. Three new findings were recorded, all low-to-medium and none regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness gap left by the SiteEventLogging-005 background-writer rework: when an event cannot be persisted because the logger has been disposed, the returned `Task` is completed *successfully* rather than faulted, so an `await`-ing caller is told a dropped audit event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the keyword-search filter (SiteEventLogging-013) and the initial purge running synchronously on the host startup thread (SiteEventLogging-014). ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). | | 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. | | 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). | | 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. | | 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. | | 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). | | 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). | | 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). | | 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). | | 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). | ## Findings ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space | | | |--|--| | Severity | High | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` | **Description** `PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to reclaim space, then re-measures the database size via `page_count * page_size`. `incremental_vacuum` only has any effect when the database was created with `auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`, `incremental_vacuum` is silently ignored and `page_count` does not decrease when rows are deleted (free pages are retained in the file). Consequently the `while (currentSizeBytes > capBytes)` loop never observes the size dropping. The storage-cap feature required by the design ("configurable maximum database size... oldest events are purged first") is therefore non-functional — it cannot bring the file back under the cap. **Recommendation** Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are created (it must be set before table creation or followed by a full `VACUUM` to take effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge deletes, or measure logical data size (e.g. `page_count - freelist_count` times `page_size`) instead of relying on `incremental_vacuum`. **Resolution** Resolved 2026-05-16 (commit ``): `InitializeSchema` now sets `PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and `GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) * page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now reliably observes the database shrinking. Regression test `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`. ### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed | | | |--|--| | Severity | High | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` | **Description** Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch, so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table is completely empty. The design states the cap should purge "the oldest events... first" to stay within budget, not wipe the whole log. When the cap is hit (e.g. during an alarm storm) this destroys all retained diagnostic history rather than trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap` masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an empty table, so the over-delete behaviour is never exercised against a realistic cap. **Recommendation** Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when the file is genuinely under the cap. Add a guard so the loop stops once `currentSizeBytes` has stopped decreasing across iterations, and add a test with a non-zero cap and a known oversized dataset to assert that only the oldest events are removed. **Resolution** Resolved 2026-05-16 (commit ``): with the size measurement fixed (SiteEventLogging-001) the cap loop terminates when the file is genuinely under the cap. An additional guard stops the loop if the on-disk size fails to decrease across an iteration, so a cap that can never be met no longer empties the whole table. Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the table is not emptied and the file ends under a realistic non-zero cap) and `PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are removed). ### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock | | | |--|--| | Severity | High | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` | **Description** `SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via `lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach into `_eventLogger.Connection` and execute commands directly, without acquiring `_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from event-recording callers and from the actor that drives the query service). A single `SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the purge thread and a recording thread (or query thread) can throw `SqliteException`/`InvalidOperationException` ("DataReader already open", "connection busy") or corrupt command state. The purge `DELETE` and the recorder `INSERT` racing is the most likely collision because event recording is continuous. **Recommendation** Funnel all access to the connection through a single synchronisation point: either expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the purge and query services their own dedicated `SqliteConnection` instances (SQLite supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout` makes this safer). Do not share one `SqliteConnection` across threads. **Resolution** Resolved 2026-05-16 (commit ``): the raw `internal Connection` property was removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)` overloads that hold the existing `_writeLock` for the duration of the caller's delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now access the connection exclusively through `WithConnection`, so the purge thread, query thread, and recording threads are serialised on a single lock. `Dispose` was also brought under the lock to avoid a dispose/use race. Regression test `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running concurrently with multiple writer threads. ### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node | | | |--|--| | Severity | Low | | Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) | | Category | Design-document adherence | | Status | Won't Fix | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` | **Description** `EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the stated intent that "queries always reach the active node". However, an Akka.NET cluster singleton is pinned to the *oldest* member of the role, which is not the same concept as the SCADA "active node" (the node currently running the Deployment Manager singleton / serving live traffic). The design doc is explicit: "Only the active node generates and stores events... the new active node starts logging to its own SQLite database." The event-log SQLite file is node-local and unreplicated. Nothing guarantees the event-log singleton co-locates with the active node, so a remote query can be served by the standby node and read that node's near-empty database, returning no events even though the active node has a full log. The explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually happens. **Recommendation** Either (a) host the query handler as a normal per-node actor and route queries to the active node explicitly (the node owning the Deployment Manager singleton), or (b) make the event-log writer follow the same singleton so the writer and the query handler are guaranteed co-located. Reconcile the design doc and the inline comment with whichever model is chosen. **Re-triage note (2026-05-16)** The finding's central claim — that a remote query "can be served by the standby node and read that node's near-empty database" — is incorrect for the query path. In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the `deployment-manager` `ClusterSingletonManager` are created with the **same role** (`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a given role to the *oldest member of that role* — so all same-role singletons in a cluster co-locate on one node. The "active node" in this codebase is, by definition, the node hosting the `deployment-manager` singleton; the event-log query singleton is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not "the opposite of what happens". A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local SQLite file on **every** node, including the standby. That wastes storage on the standby but does **not** cause the query-returns-nothing symptom the finding describes, because the query singleton always reads the *active* node's (populated) database. Gating the writer to the active node would be a `ScadaLink.Host` wiring change, outside this module's scope, and is a minor optimisation rather than a correctness defect. Re-triaged from High to Low and closed as **Won't Fix**: the High-severity correctness claim does not hold. Any residual cleanup (gate the standby-node writer; the comment needs no change) can be raised as a fresh Low finding against `ScadaLink.Host` if desired. **Resolution** Won't Fix — 2026-05-16 (commit ``). Re-triaged: the asserted defect (query served by standby returning an empty log) cannot occur because the event-log query singleton and the deployment-manager singleton share a role and so always co-locate on the active node. No code change made; see the re-triage note above. ### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread | | | |--|--| | Severity | Medium | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` | **Description** `LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its body is entirely synchronous: it takes `lock (_writeLock)`, runs `cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`. Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)` (e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`, `ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write, and any contention on `_writeLock`, executes inline on the Akka actor thread of the calling subsystem. Under an event burst (alarm storm, script failure loop) this serialises actor threads on disk I/O and the global write lock, degrading the hot-path subsystems the design intends to keep responsive. **Recommendation** Either make recording genuinely asynchronous (offload to a dedicated single-threaded writer / `Channel` consumer so callers truly fire-and-forget), or rename the method to `LogEvent` and document that it blocks, so callers can decide. Given the design's emphasis on not impacting runtime subsystems, an internal queue with a background flush is preferable. **Resolution** Resolved 2026-05-16 (commit `pending`): event recording is now offloaded to a dedicated background writer. `SiteEventLogger` owns an unbounded `Channel` and a single background consumer thread; `LogEventAsync` only validates its arguments and enqueues, so caller threads (Akka actor threads on hot paths) never block on the SQLite write or on contention for the write lock. The returned `Task` completes once the event is durably persisted (so `await` callers still observe write ordering) and faults if the write fails. `Dispose` completes the channel and drains the writer. Regression test `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` (verifies the caller returns in <500 ms while the database is held busy) plus `LogEventAsync_TaskCompletes_AfterEventIsPersisted`. ### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` | **Description** `InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`. The query service also filters on `severity` (`severity = $severity`) and performs `message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a 1 GB database and a 500-row page size, severity-filtered and keyword queries do full table scans on every page. The design explicitly lists keyword search as a supported, expected query type. **Recommendation** Add an index on `severity` (or a composite index aligned with common filter combinations such as `(event_type, severity, id)`). For keyword search, consider an FTS5 virtual table over `message` and `source`, or accept the scan but document the cost. **Resolution** Resolved 2026-05-16 (commit `pending`): `InitializeSchema` now creates `idx_events_severity` on `site_events(severity)`, so severity-filtered queries use an index instead of full-scanning. The leading-wildcard `LIKE` keyword search cannot use a B-tree index by construction; rather than adding an FTS5 table, that scan is accepted and now documented with an inline comment next to the index DDL. Regression tests `Schema_HasIndexOnSeverity` and `SeverityFilteredQuery_UsesIndex_NotFullScan` (the latter asserts the SQLite query plan uses `idx_events_severity` and does not `SCAN site_events`). ### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type | | | |--|--| | Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23` | **Description** Both `EventLogPurgeService` and `EventLogQueryService` took `ISiteEventLogger` via DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. This made the registration fragile — any `ISiteEventLogger` that is not exactly `SiteEventLogger` (a test double, a decorator) caused an `InvalidCastException` at construction — and defeated the purpose of the interface abstraction. **Re-triage note (2026-05-16)** The finding as originally written also claimed the services "access the `internal SqliteConnection Connection` property to run arbitrary SQL" and called itself "the root cause of the unsynchronised connection sharing in SiteEventLogging-003". That part is stale: the resolution of SiteEventLogging-003 had already removed the `internal Connection` property and replaced it with lock-guarded `WithConnection` overloads. At the time this finding was actioned, the only remaining defect was the concrete-type downcast itself. Severity stays Medium; the description is corrected to the downcast-only scope. **Recommendation** Have the purge and query services depend on the concrete `SiteEventLogger` directly (it is the type that owns the lock-guarded `WithConnection`), and register the concrete type in DI with the interface forwarded to the same singleton. Remove the fragile downcasts. **Resolution** Resolved 2026-05-16 (commit `pending`): `AddSiteEventLogging` now registers `SiteEventLogger` as a concrete singleton and forwards `ISiteEventLogger` to that same instance. `EventLogPurgeService` and `EventLogQueryService` take a `SiteEventLogger` constructor parameter directly, eliminating the `(SiteEventLogger)eventLogger` downcast and its `InvalidCastException` risk. All three services still share one connection/lock. Regression tests `AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance` and `PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast`. ### SiteEventLogging-008 — Event-recording write failures are silently swallowed | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` | **Description** If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so they cannot detect failure. The event log is the site's diagnostic audit trail; a sustained write failure (for example a locked-database storm caused by the unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to operators except a local log line that nobody is watching. There is no failure counter, no health-metric hook, and no retry. **Recommendation** Expose a failure signal: increment a counter that the Health Monitoring component can surface (the design notes script/alarm error rates are derived from the event log — a logging outage should be visible). At minimum, escalate repeated failures to a Warning/Error health metric rather than only a local log line. **Resolution** Resolved 2026-05-16 (commit `pending`): write failures are no longer swallowed. The background writer (introduced for SiteEventLogging-005) now, on an `INSERT` failure, (a) increments a new `Interlocked`-guarded counter exposed as the public `SiteEventLogger.FailedWriteCount` property — which Health Monitoring can poll to detect a logging outage — and (b) faults the `Task` returned by `LogEventAsync` with the exception instead of returning `Task.CompletedTask`. The error is still logged. Regression test `LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError`. ### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` | **Description** The interface XML doc states "Record an event asynchronously." and the method is named `LogEventAsync`, but the implementation is fully synchronous (see SiteEventLogging-005). The documentation and naming are misleading: a reader will reasonably assume the write is offloaded and the caller's thread is not blocked, which is false. The `details` parameter doc says "Optional JSON details" but nothing validates or requires JSON, so callers may pass arbitrary text. **Re-triage note (2026-05-16)** The finding's central premise is now stale. SiteEventLogging-005 was resolved by introducing a dedicated background writer: `LogEventAsync` enqueues onto an unbounded `Channel` and returns without blocking, so the method is now *genuinely* asynchronous and the name `LogEventAsync` is accurate. No rename is warranted. The only residual documentation defect was the imprecise XML doc (the one-line summary did not describe the enqueue/await semantics) and the `details` doc claiming "JSON" when nothing enforces it. Scope corrected to those doc fixes only. **Recommendation** Align the name, signature, and documentation with the actual behaviour — either make the method genuinely asynchronous or rename to `LogEvent` and correct the doc. Clarify that `details` is free-form text unless JSON is actually enforced. **Resolution** Resolved 2026-05-16 (commit `pending`): the `ISiteEventLogger.LogEventAsync` XML doc now describes the actual behaviour — the call enqueues onto a background writer and returns without blocking, and the returned `Task` completes on durable persistence (or faults on write failure). The `details` parameter doc was corrected to "Optional free-form detail text ... JSON is conventional but not validated or enforced". No rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so `LogEventAsync` is accurate (see Re-triage note). Documentation-only change; no regression test added (behaviour is already covered by the SiteEventLogging-005 tests `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` and `LogEventAsync_TaskCompletes_AfterEventIsPersisted`). ### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path | | | |--|--| | Severity | Medium | | Category | Testing coverage | | Status | Resolved | | Location | `tests/ScadaLink.SiteEventLogging.Tests/` | **Description** The test suite covers recording, query filtering/pagination, and basic purge, but several critical behaviours are untested: - `EventLogHandlerActor` has no test — the actor message contract (`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified. - No test exercises purge running concurrently with active writes/queries, so the connection-sharing race (SiteEventLogging-003) is invisible to CI. - `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the file shrinks or that only oldest events are removed under a realistic cap. - `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`, `ErrorMessage`) has no test. - `SiteEventLogger.Dispose` semantics (logging after dispose returns `Task.CompletedTask`) and re-entrant dispose are untested. **Recommendation** Add tests for the actor bridge, a concurrency stress test (purge + write + query in parallel), a realistic non-zero-cap purge test asserting size reduction and oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection). **Resolution** Resolved 2026-05-16 (commit `pending`): the remaining coverage gaps are now filled. `EventLogHandlerActorTests` (using `Akka.TestKit.Xunit2`) exercises the actor message contract — `EventLogQueryRequest` -> `EventLogQueryResponse` via `Sender.Tell`, for both success and error responses. `EventLogCoverageTests` covers the previously untested `EventLogQueryService.ExecuteQuery` catch block (`ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable`) and the recorder's disposed-state semantics (`LogEventAsync_AfterDispose_CompletesWithoutThrowing`, `Dispose_IsIdempotent`). The purge/write concurrency, realistic-cap, and oldest-first behaviours were already covered by the tests added when SiteEventLogging-001/-002/-003 were resolved (`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection`, `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`, `PurgeByStorageCap_RemovesOldestEventsFirst`). ### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions` | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` | **Description** `AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through this method. The placeholder is dead code: it is either never called or called with no effect, and the comment is stale. A reader looking for where the event-log actor is wired up will be misdirected. **Recommendation** Either implement the actor registration here and have `AkkaHostedService` call it (centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove the misleading comment. **Resolution** Resolved 2026-05-16 (commit `pending`): confirmed dead code — a repo-wide search found zero callers of `AddSiteEventLoggingActors`, and `EventLogHandlerActor` is in fact wired up directly in `ScadaLink.Host/Actors/AkkaHostedService.cs` as a cluster singleton (it must be created inside the `ActorSystem` with a resolved `IEventLogQueryService`, which a `IServiceCollection` extension cannot do). The empty placeholder method and its stale "Phase 4+" comment were deleted, and a short explanatory note added to `AddSiteEventLogging` pointing readers to where the actor is actually registered. Documentation/dead-code change only; no regression test was added — the change is a method removal verified by the compiler (no callers) and the full module suite still passing. ### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` | **Description** `LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under SiteEventLogging-009), "completes once the event is durably persisted and faults if the write fails, so callers that `await` it observe success or failure." Two paths break that contract by signalling **success** for an event that was never written: 1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel has been completed because the logger was disposed), the code calls `pending.Completion.TrySetResult()` — completing the `Task` successfully — even though the comment immediately above acknowledges "there is nowhere to persist the event." 2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has been disposed mid-drain. The code does not inspect the returned `written` flag and unconditionally calls `pending.Completion.TrySetResult()`, again reporting success for an event the comment admits "simply cannot be persisted." The event log is the site's diagnostic audit trail. A caller that `await`s `LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was recorded will observe a *successful* completion for an event that was silently dropped. This is the same class of defect SiteEventLogging-008 fixed for write *errors* — but the disposed-drop path was left reporting false success. The window is the disposal/shutdown interval, during which shutdown-related events (graceful singleton handover, instance disable) are exactly the events most likely to be enqueued and lost. **Recommendation** For both paths, fault the `Task` (or complete it with a sentinel failure) instead of `TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))` — so an `await`-ing caller can distinguish a dropped event from a persisted one. Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync` and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a deliberate "drop silently on shutdown" semantics is chosen instead. **Resolution** _Unresolved._ ### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` | **Description** The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"` and binds it as a parameter. Parameterisation correctly prevents SQL injection, but it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the user-supplied keyword. A search for a literal `_` (common in event sources and identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as "match any single character", and a `%` matches any run of characters. The design calls keyword search "free-text search on message and source fields ... Useful for finding events by script name, alarm name, or error message" — users will reasonably expect a literal substring match, so a query for `store_and_forward` silently returns events containing `storeXandYforward` and similar false positives. There is no way for the caller to search for a literal underscore or percent. **Recommendation** Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression (e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the design implies. **Resolution** _Unresolved._ ### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` | **Description** `EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await` (`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is invoked from `StartAsync`, and the host's startup pipeline does not proceed past a `BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because `RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a near-1 GB database is back under the cap — executes inline on the startup thread, blocking host startup (and therefore the `/health/ready` gate) for as long as the purge takes. On a site that has accumulated a large log this can be a multi-second stall during every node start/failover. The class doc states the service "runs on a background thread and does not block event recording" — the startup-thread block is inconsistent with that intent. **Recommendation** Yield before the initial purge so it runs on the background scheduler rather than the startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`, or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync` (accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`. **Resolution** _Unresolved._