# Code Review — SiteEventLogging | Field | Value | |-------|-------| | Module | `src/ScadaLink.SiteEventLogging` | | Design doc | `docs/requirements/Component-SiteEventLogging.md` | | Status | Reviewed | | Last reviewed | 2026-05-28 | | Reviewer | claude-agent | | Commit reviewed | `1eb6e97` | | Open findings | 9 | ## Summary The SiteEventLogging module is small and broadly well-structured: a SQLite-backed recorder (`SiteEventLogger`), a query service with keyset pagination, a background purge service, and a thin Akka actor bridge. The query path is parameterised correctly (no SQL injection) and reasonably well tested. However, the storage-cap enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because `auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the database shrink and over-deletes the entire table when triggered. There is also a genuine concurrency hazard: the purge service and query service share the single `SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge running on the background thread can collide with a write or a query on another thread. The `LogEventAsync` API is synchronous despite its name and `Task` return, which silently blocks Akka actor threads on disk I/O. Other findings concern the cluster-singleton placement of the handler actor (which can pin to the standby node), missing indexes for common query filters, retention/cap purge not enforcing the requirement strictly, and several documentation/maintainability issues. #### Re-review 2026-05-17 (commit `39d737e`) Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed (SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions hold up under inspection — the background writer, lock-guarded `WithConnection`, `auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and the concrete-recorder DI wiring are all present and correct at this commit. The module source is byte-identical between `39d737e` and current `HEAD`, so this review reflects the live code. Three new findings were recorded, all low-to-medium and none regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness gap left by the SiteEventLogging-005 background-writer rework: when an event cannot be persisted because the logger has been disposed, the returned `Task` is completed *successfully* rather than faulted, so an `await`-ing caller is told a dropped audit event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block on the host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the premise does not hold on .NET 8+). #### Re-review 2026-05-28 (commit `1eb6e97`) Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed and their resolutions hold up under inspection: the lock-guarded `WithConnection` overloads, the background-writer `Channel` with disposed-mid-drain fault propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI wiring are all present and correct at this commit. Nine new findings were recorded — none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**) is a correctness defect in the query path: timestamps are stored as ISO 8601 strings generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix), but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")` without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset` gets a broken lexicographic comparison and either spuriously includes or excludes events. The next-most-notable findings are SiteEventLogging-015 (unbounded background write queue can grow without limit under sustained writer slowness — sister `SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's "prevents broad queries from overwhelming the communication channel" rationale). The remaining findings are low-severity hygiene / documentation: an unused `FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture parsing, the purge service running on the standby node, the redundant `Cache=Shared` on a single-connection logger, and a non-volatile stop flag in a concurrency stress test. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). | | 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. | | 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). | | 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. | | 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. | | 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: claimed initial-purge startup-thread block (-014 — Won't Fix, premise invalid on .NET 8+). | | 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). | | 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). | | 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). | | 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). | _Re-review (2026-05-28, `1eb6e97`):_ | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). | | 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. | | 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. | | 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). | | 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. | | 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). | | 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). | | 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). | | 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. | | 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). | ## Findings ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space | | | |--|--| | Severity | High | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` | **Description** `PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to reclaim space, then re-measures the database size via `page_count * page_size`. `incremental_vacuum` only has any effect when the database was created with `auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`, `incremental_vacuum` is silently ignored and `page_count` does not decrease when rows are deleted (free pages are retained in the file). Consequently the `while (currentSizeBytes > capBytes)` loop never observes the size dropping. The storage-cap feature required by the design ("configurable maximum database size... oldest events are purged first") is therefore non-functional — it cannot bring the file back under the cap. **Recommendation** Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are created (it must be set before table creation or followed by a full `VACUUM` to take effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge deletes, or measure logical data size (e.g. `page_count - freelist_count` times `page_size`) instead of relying on `incremental_vacuum`. **Resolution** Resolved 2026-05-16 (commit ``): `InitializeSchema` now sets `PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and `GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) * page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now reliably observes the database shrinking. Regression test `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`. ### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed | | | |--|--| | Severity | High | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` | **Description** Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch, so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table is completely empty. The design states the cap should purge "the oldest events... first" to stay within budget, not wipe the whole log. When the cap is hit (e.g. during an alarm storm) this destroys all retained diagnostic history rather than trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap` masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an empty table, so the over-delete behaviour is never exercised against a realistic cap. **Recommendation** Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when the file is genuinely under the cap. Add a guard so the loop stops once `currentSizeBytes` has stopped decreasing across iterations, and add a test with a non-zero cap and a known oversized dataset to assert that only the oldest events are removed. **Resolution** Resolved 2026-05-16 (commit ``): with the size measurement fixed (SiteEventLogging-001) the cap loop terminates when the file is genuinely under the cap. An additional guard stops the loop if the on-disk size fails to decrease across an iteration, so a cap that can never be met no longer empties the whole table. Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the table is not emptied and the file ends under a realistic non-zero cap) and `PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are removed). ### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock | | | |--|--| | Severity | High | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` | **Description** `SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via `lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach into `_eventLogger.Connection` and execute commands directly, without acquiring `_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from event-recording callers and from the actor that drives the query service). A single `SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the purge thread and a recording thread (or query thread) can throw `SqliteException`/`InvalidOperationException` ("DataReader already open", "connection busy") or corrupt command state. The purge `DELETE` and the recorder `INSERT` racing is the most likely collision because event recording is continuous. **Recommendation** Funnel all access to the connection through a single synchronisation point: either expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the purge and query services their own dedicated `SqliteConnection` instances (SQLite supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout` makes this safer). Do not share one `SqliteConnection` across threads. **Resolution** Resolved 2026-05-16 (commit ``): the raw `internal Connection` property was removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)` overloads that hold the existing `_writeLock` for the duration of the caller's delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now access the connection exclusively through `WithConnection`, so the purge thread, query thread, and recording threads are serialised on a single lock. `Dispose` was also brought under the lock to avoid a dispose/use race. Regression test `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running concurrently with multiple writer threads. ### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node | | | |--|--| | Severity | Low | | Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) | | Category | Design-document adherence | | Status | Won't Fix | | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` | **Description** `EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the stated intent that "queries always reach the active node". However, an Akka.NET cluster singleton is pinned to the *oldest* member of the role, which is not the same concept as the SCADA "active node" (the node currently running the Deployment Manager singleton / serving live traffic). The design doc is explicit: "Only the active node generates and stores events... the new active node starts logging to its own SQLite database." The event-log SQLite file is node-local and unreplicated. Nothing guarantees the event-log singleton co-locates with the active node, so a remote query can be served by the standby node and read that node's near-empty database, returning no events even though the active node has a full log. The explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually happens. **Recommendation** Either (a) host the query handler as a normal per-node actor and route queries to the active node explicitly (the node owning the Deployment Manager singleton), or (b) make the event-log writer follow the same singleton so the writer and the query handler are guaranteed co-located. Reconcile the design doc and the inline comment with whichever model is chosen. **Re-triage note (2026-05-16)** The finding's central claim — that a remote query "can be served by the standby node and read that node's near-empty database" — is incorrect for the query path. In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the `deployment-manager` `ClusterSingletonManager` are created with the **same role** (`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a given role to the *oldest member of that role* — so all same-role singletons in a cluster co-locate on one node. The "active node" in this codebase is, by definition, the node hosting the `deployment-manager` singleton; the event-log query singleton is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not "the opposite of what happens". A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local SQLite file on **every** node, including the standby. That wastes storage on the standby but does **not** cause the query-returns-nothing symptom the finding describes, because the query singleton always reads the *active* node's (populated) database. Gating the writer to the active node would be a `ScadaLink.Host` wiring change, outside this module's scope, and is a minor optimisation rather than a correctness defect. Re-triaged from High to Low and closed as **Won't Fix**: the High-severity correctness claim does not hold. Any residual cleanup (gate the standby-node writer; the comment needs no change) can be raised as a fresh Low finding against `ScadaLink.Host` if desired. **Resolution** Won't Fix — 2026-05-16 (commit ``). Re-triaged: the asserted defect (query served by standby returning an empty log) cannot occur because the event-log query singleton and the deployment-manager singleton share a role and so always co-locate on the active node. No code change made; see the re-triage note above. ### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread | | | |--|--| | Severity | Medium | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` | **Description** `LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its body is entirely synchronous: it takes `lock (_writeLock)`, runs `cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`. Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)` (e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`, `ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write, and any contention on `_writeLock`, executes inline on the Akka actor thread of the calling subsystem. Under an event burst (alarm storm, script failure loop) this serialises actor threads on disk I/O and the global write lock, degrading the hot-path subsystems the design intends to keep responsive. **Recommendation** Either make recording genuinely asynchronous (offload to a dedicated single-threaded writer / `Channel` consumer so callers truly fire-and-forget), or rename the method to `LogEvent` and document that it blocks, so callers can decide. Given the design's emphasis on not impacting runtime subsystems, an internal queue with a background flush is preferable. **Resolution** Resolved 2026-05-16 (commit `pending`): event recording is now offloaded to a dedicated background writer. `SiteEventLogger` owns an unbounded `Channel` and a single background consumer thread; `LogEventAsync` only validates its arguments and enqueues, so caller threads (Akka actor threads on hot paths) never block on the SQLite write or on contention for the write lock. The returned `Task` completes once the event is durably persisted (so `await` callers still observe write ordering) and faults if the write fails. `Dispose` completes the channel and drains the writer. Regression test `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` (verifies the caller returns in <500 ms while the database is held busy) plus `LogEventAsync_TaskCompletes_AfterEventIsPersisted`. ### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` | **Description** `InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`. The query service also filters on `severity` (`severity = $severity`) and performs `message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a 1 GB database and a 500-row page size, severity-filtered and keyword queries do full table scans on every page. The design explicitly lists keyword search as a supported, expected query type. **Recommendation** Add an index on `severity` (or a composite index aligned with common filter combinations such as `(event_type, severity, id)`). For keyword search, consider an FTS5 virtual table over `message` and `source`, or accept the scan but document the cost. **Resolution** Resolved 2026-05-16 (commit `pending`): `InitializeSchema` now creates `idx_events_severity` on `site_events(severity)`, so severity-filtered queries use an index instead of full-scanning. The leading-wildcard `LIKE` keyword search cannot use a B-tree index by construction; rather than adding an FTS5 table, that scan is accepted and now documented with an inline comment next to the index DDL. Regression tests `Schema_HasIndexOnSeverity` and `SeverityFilteredQuery_UsesIndex_NotFullScan` (the latter asserts the SQLite query plan uses `idx_events_severity` and does not `SCAN site_events`). ### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type | | | |--|--| | Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23` | **Description** Both `EventLogPurgeService` and `EventLogQueryService` took `ISiteEventLogger` via DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. This made the registration fragile — any `ISiteEventLogger` that is not exactly `SiteEventLogger` (a test double, a decorator) caused an `InvalidCastException` at construction — and defeated the purpose of the interface abstraction. **Re-triage note (2026-05-16)** The finding as originally written also claimed the services "access the `internal SqliteConnection Connection` property to run arbitrary SQL" and called itself "the root cause of the unsynchronised connection sharing in SiteEventLogging-003". That part is stale: the resolution of SiteEventLogging-003 had already removed the `internal Connection` property and replaced it with lock-guarded `WithConnection` overloads. At the time this finding was actioned, the only remaining defect was the concrete-type downcast itself. Severity stays Medium; the description is corrected to the downcast-only scope. **Recommendation** Have the purge and query services depend on the concrete `SiteEventLogger` directly (it is the type that owns the lock-guarded `WithConnection`), and register the concrete type in DI with the interface forwarded to the same singleton. Remove the fragile downcasts. **Resolution** Resolved 2026-05-16 (commit `pending`): `AddSiteEventLogging` now registers `SiteEventLogger` as a concrete singleton and forwards `ISiteEventLogger` to that same instance. `EventLogPurgeService` and `EventLogQueryService` take a `SiteEventLogger` constructor parameter directly, eliminating the `(SiteEventLogger)eventLogger` downcast and its `InvalidCastException` risk. All three services still share one connection/lock. Regression tests `AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance` and `PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast`. ### SiteEventLogging-008 — Event-recording write failures are silently swallowed | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` | **Description** If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so they cannot detect failure. The event log is the site's diagnostic audit trail; a sustained write failure (for example a locked-database storm caused by the unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to operators except a local log line that nobody is watching. There is no failure counter, no health-metric hook, and no retry. **Recommendation** Expose a failure signal: increment a counter that the Health Monitoring component can surface (the design notes script/alarm error rates are derived from the event log — a logging outage should be visible). At minimum, escalate repeated failures to a Warning/Error health metric rather than only a local log line. **Resolution** Resolved 2026-05-16 (commit `pending`): write failures are no longer swallowed. The background writer (introduced for SiteEventLogging-005) now, on an `INSERT` failure, (a) increments a new `Interlocked`-guarded counter exposed as the public `SiteEventLogger.FailedWriteCount` property — which Health Monitoring can poll to detect a logging outage — and (b) faults the `Task` returned by `LogEventAsync` with the exception instead of returning `Task.CompletedTask`. The error is still logged. Regression test `LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError`. ### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` | **Description** The interface XML doc states "Record an event asynchronously." and the method is named `LogEventAsync`, but the implementation is fully synchronous (see SiteEventLogging-005). The documentation and naming are misleading: a reader will reasonably assume the write is offloaded and the caller's thread is not blocked, which is false. The `details` parameter doc says "Optional JSON details" but nothing validates or requires JSON, so callers may pass arbitrary text. **Re-triage note (2026-05-16)** The finding's central premise is now stale. SiteEventLogging-005 was resolved by introducing a dedicated background writer: `LogEventAsync` enqueues onto an unbounded `Channel` and returns without blocking, so the method is now *genuinely* asynchronous and the name `LogEventAsync` is accurate. No rename is warranted. The only residual documentation defect was the imprecise XML doc (the one-line summary did not describe the enqueue/await semantics) and the `details` doc claiming "JSON" when nothing enforces it. Scope corrected to those doc fixes only. **Recommendation** Align the name, signature, and documentation with the actual behaviour — either make the method genuinely asynchronous or rename to `LogEvent` and correct the doc. Clarify that `details` is free-form text unless JSON is actually enforced. **Resolution** Resolved 2026-05-16 (commit `pending`): the `ISiteEventLogger.LogEventAsync` XML doc now describes the actual behaviour — the call enqueues onto a background writer and returns without blocking, and the returned `Task` completes on durable persistence (or faults on write failure). The `details` parameter doc was corrected to "Optional free-form detail text ... JSON is conventional but not validated or enforced". No rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so `LogEventAsync` is accurate (see Re-triage note). Documentation-only change; no regression test added (behaviour is already covered by the SiteEventLogging-005 tests `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` and `LogEventAsync_TaskCompletes_AfterEventIsPersisted`). ### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path | | | |--|--| | Severity | Medium | | Category | Testing coverage | | Status | Resolved | | Location | `tests/ScadaLink.SiteEventLogging.Tests/` | **Description** The test suite covers recording, query filtering/pagination, and basic purge, but several critical behaviours are untested: - `EventLogHandlerActor` has no test — the actor message contract (`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified. - No test exercises purge running concurrently with active writes/queries, so the connection-sharing race (SiteEventLogging-003) is invisible to CI. - `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the file shrinks or that only oldest events are removed under a realistic cap. - `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`, `ErrorMessage`) has no test. - `SiteEventLogger.Dispose` semantics (logging after dispose returns `Task.CompletedTask`) and re-entrant dispose are untested. **Recommendation** Add tests for the actor bridge, a concurrency stress test (purge + write + query in parallel), a realistic non-zero-cap purge test asserting size reduction and oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection). **Resolution** Resolved 2026-05-16 (commit `pending`): the remaining coverage gaps are now filled. `EventLogHandlerActorTests` (using `Akka.TestKit.Xunit2`) exercises the actor message contract — `EventLogQueryRequest` -> `EventLogQueryResponse` via `Sender.Tell`, for both success and error responses. `EventLogCoverageTests` covers the previously untested `EventLogQueryService.ExecuteQuery` catch block (`ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable`) and the recorder's disposed-state semantics (`LogEventAsync_AfterDispose_CompletesWithoutThrowing`, `Dispose_IsIdempotent`). The purge/write concurrency, realistic-cap, and oldest-first behaviours were already covered by the tests added when SiteEventLogging-001/-002/-003 were resolved (`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection`, `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`, `PurgeByStorageCap_RemovesOldestEventsFirst`). ### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions` | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` | **Description** `AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through this method. The placeholder is dead code: it is either never called or called with no effect, and the comment is stale. A reader looking for where the event-log actor is wired up will be misdirected. **Recommendation** Either implement the actor registration here and have `AkkaHostedService` call it (centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove the misleading comment. **Resolution** Resolved 2026-05-16 (commit `pending`): confirmed dead code — a repo-wide search found zero callers of `AddSiteEventLoggingActors`, and `EventLogHandlerActor` is in fact wired up directly in `ScadaLink.Host/Actors/AkkaHostedService.cs` as a cluster singleton (it must be created inside the `ActorSystem` with a resolved `IEventLogQueryService`, which a `IServiceCollection` extension cannot do). The empty placeholder method and its stale "Phase 4+" comment were deleted, and a short explanatory note added to `AddSiteEventLogging` pointing readers to where the actor is actually registered. Documentation/dead-code change only; no regression test was added — the change is a method removal verified by the compiler (no callers) and the full module suite still passing. ### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` | **Description** `LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under SiteEventLogging-009), "completes once the event is durably persisted and faults if the write fails, so callers that `await` it observe success or failure." Two paths break that contract by signalling **success** for an event that was never written: 1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel has been completed because the logger was disposed), the code calls `pending.Completion.TrySetResult()` — completing the `Task` successfully — even though the comment immediately above acknowledges "there is nowhere to persist the event." 2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has been disposed mid-drain. The code does not inspect the returned `written` flag and unconditionally calls `pending.Completion.TrySetResult()`, again reporting success for an event the comment admits "simply cannot be persisted." The event log is the site's diagnostic audit trail. A caller that `await`s `LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was recorded will observe a *successful* completion for an event that was silently dropped. This is the same class of defect SiteEventLogging-008 fixed for write *errors* — but the disposed-drop path was left reporting false success. The window is the disposal/shutdown interval, during which shutdown-related events (graceful singleton handover, instance disable) are exactly the events most likely to be enqueued and lost. **Recommendation** For both paths, fault the `Task` (or complete it with a sentinel failure) instead of `TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))` — so an `await`-ing caller can distinguish a dropped event from a persisted one. Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync` and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a deliberate "drop silently on shutdown" semantics is chosen instead. **Resolution** Resolved 2026-05-17 (commit ``): both disposed-drop paths now fault the returned `Task` with `ObjectDisposedException` instead of `TrySetResult()` — `LogEventAsync` on a failed `TryWrite`, and `ProcessWriteQueueAsync` now inspects the `written` flag from `WithConnection` and only calls `TrySetResult()` when the row was actually persisted. An `await`-ing caller can now distinguish a dropped audit event from a persisted one. Regression tests `LogEventAsync_AfterDispose_FaultsTask_NotReportsSuccess` and `LogEventAsync_EnqueuedThenDisposed_FaultsTask_WhenWriteCannotComplete` (the prior `LogEventAsync_AfterDispose_CompletesWithoutThrowing` test asserted the now-incorrect silent-success behaviour and was replaced). ### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` | **Description** The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"` and binds it as a parameter. Parameterisation correctly prevents SQL injection, but it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the user-supplied keyword. A search for a literal `_` (common in event sources and identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as "match any single character", and a `%` matches any run of characters. The design calls keyword search "free-text search on message and source fields ... Useful for finding events by script name, alarm name, or error message" — users will reasonably expect a literal substring match, so a query for `store_and_forward` silently returns events containing `storeXandYforward` and similar false positives. There is no way for the caller to search for a literal underscore or percent. **Recommendation** Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression (e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the design implies. **Resolution** Resolved 2026-05-17 (commit ``): `EventLogQueryService` now escapes `\`, `%`, and `_` in `request.KeywordFilter` via a new `EscapeLikePattern` helper before wrapping it in `%...%`, and the `LIKE` clauses carry an explicit `ESCAPE '\'` so the keyword is matched as a literal substring as the design intends. Regression tests `Query_KeywordSearch_TreatsUnderscoreAsLiteral_NotWildcard`, `Query_KeywordSearch_TreatsPercentAsLiteral_NotWildcard`, and `Query_KeywordSearch_StillMatchesPlainSubstring`. ### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Won't Fix | | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` | **Description** `EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await` (`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is invoked from `StartAsync`, and the host's startup pipeline does not proceed past a `BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because `RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a near-1 GB database is back under the cap — executes inline on the startup thread, blocking host startup (and therefore the `/health/ready` gate) for as long as the purge takes. On a site that has accumulated a large log this can be a multi-second stall during every node start/failover. The class doc states the service "runs on a background thread and does not block event recording" — the startup-thread block is inconsistent with that intent. **Recommendation** Yield before the initial purge so it runs on the background scheduler rather than the startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`, or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync` (accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`. **Re-triage note (2026-05-17)** The finding's central premise — "the host's startup pipeline does not proceed past a `BackgroundService` until its `ExecuteAsync` yields at the first real `await`", so the synchronous `RunPurge()` prelude blocks the startup thread — is **incorrect for the .NET version this project targets**. The module targets `net10.0`. Since .NET 8, `BackgroundService.StartAsync` no longer runs `ExecuteAsync` inline on the calling (host startup) thread: `ExecuteAsync` is dispatched onto the thread pool, and `StartAsync` returns immediately regardless of how long the synchronous prelude runs. This was verified empirically. A standalone `BackgroundService` whose `ExecuteAsync` prelude does a 1.5 s synchronous `Thread.Sleep` before its first `await` showed `StartAsync` returning in **0 ms**, with the prelude running on a *different* thread pool thread than the caller. Applied to `EventLogPurgeService`, `StartAsync` returns promptly and the initial `RunPurge()` executes on the background scheduler — host startup and the `/health/ready` gate are not blocked. There is therefore no defect to fix. **Resolution** Won't Fix — 2026-05-17 (commit ``). Re-triaged: the asserted startup-thread block cannot occur on .NET 8+ (this module targets `net10.0`), where `BackgroundService` dispatches `ExecuteAsync` to the thread pool rather than running its synchronous prelude on the host startup thread — verified empirically (see the re-triage note). No code change made. A verification test `StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour (asserts `StartAsync` returns in under 1 s and the initial purge still runs on the background scheduler). ### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness | | | |--|--| | Severity | Medium | | Category | Performance & resource management | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` | **Description** `SiteEventLogger` creates its background-writer feeder as `Channel.CreateUnbounded(...)`. The writer thread funnels every write through the shared `_writeLock` (acquired by `WithConnection`), so any condition that makes a single iteration slow — a long-running query in `EventLogQueryService` holding the lock, a `PurgeByStorageCap` run that takes the lock for batched `DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst from an alarm storm / script failure loop — drives the queue arbitrarily large. Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload strings, so there is no upper bound on how much memory the recorder can hold. The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs` addresses the same hot-path-writer problem with `Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`, giving back-pressure to producers. Site event logging picked the riskier choice for a component that — per the design — is fed by every site subsystem (script, alarm, deployment, DCL, store-and-forward, instance lifecycle, notification) and has both a 30-day retention sweep and a 1 GB cap-purge competing for the same lock. **Recommendation** Switch to `Channel.CreateBounded(...)` with a configurable capacity (default in the order of 10 000 — large enough to absorb a normal alarm burst, small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for back-pressure (callers `await` and serialise their actor thread on the queue — defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` / `DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the chosen policy on `ISiteEventLogger.LogEventAsync`. ### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps | | | |--|--| | Severity | High | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` | **Description** Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`, which always emits the round-trip ISO 8601 form ending in the literal offset `+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by range using a direct string comparison: ``` whereClauses.Add("timestamp >= $from"); parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o"))); ``` `request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater* than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare (SQLite default `BINARY` collation), so the query either spuriously excludes events that genuinely occurred in the range, or spuriously includes events from a wholly different hour. The same defect applies to `To`. The retention purge does `DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the central query path is vulnerable. The design explicitly states "All timestamps are UTC throughout the system" but the boundary between a central `DateTimeOffset` and the SQLite store is not enforced. A central UI rendered in a non-UTC timezone is the most likely trigger, and the defect silently corrupts every query that filters by time range — exactly the filter most likely to be set on a "show me what happened around the failover" query. **Recommendation** Normalise `From` / `To` to UTC before serialising: `request.From.Value.ToUniversalTime().ToString("o")` (or `.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a regression test that filters with a `DateTimeOffset` carrying a non-zero offset and asserts the matching events are returned. Optionally also store timestamps as Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the lexicographic-comparison hazard structurally. ### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale | | | |--|--| | Severity | Medium | | Category | Security | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` | **Description** `EventLogQueryService.ExecuteQuery` resolves the effective page size as `var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;` and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect "has more"). There is no upper bound. A central client — buggy or hostile — can send `PageSize = int.MaxValue`, in which case the query attempts to materialise the entire (up to 1 GB) event log into a single `List` while holding the shared write lock. This: - Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster message serialisation limits, will then be serialised into an `EventLogQueryResponse` and pushed over the ClusterClient pipe. - Blocks all writes (purge, recorder hot path) for the duration of the scan because the read holds `_writeLock`. - Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate queries. The design explicitly justifies pagination as preventing exactly this — "Results are paginated with a configurable page size (default: 500 events) ... This prevents broad queries from overwhelming the communication channel." The code honours the *default* but does not enforce an *upper bound* on a client-supplied override. **Recommendation** Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`, default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) — a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize exceeds maximum"` reject path so a misbehaving central is told why its query is refused. ### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` | **Description** `SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the XML doc statement "Surfaced so Health Monitoring can detect a logging outage instead of relying on a local log line nobody is watching." The implementation is correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but a repo-wide search shows **no** caller anywhere in `src/` reads the property — neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's `/health` endpoint. The metric is dead-letter: a logging outage still goes unnoticed in production, contradicting the original finding's resolution claim. The property is also exposed only on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so even if Health Monitoring were wired up it would have to take a concrete-type dependency (`internal Connection` removed, but `FailedWriteCount` remained concrete-only). **Recommendation** Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric pipeline (e.g. publish it alongside other 30-second-interval site metrics, and promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger` so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by softening the XML doc to "Available for future Health Monitoring integration" and file a tracking item for the wiring. The current doc claim is misleading. ### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node" | | | |--|--| | Severity | Low | | Category | Design-document adherence | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:21`, `docs/requirements/Component-SiteEventLogging.md:45` | **Description** `AddSiteEventLogging` calls `services.AddHostedService()`, which registers the purge `BackgroundService` per host. On a 2-node site cluster both `node-a` and `node-b` start the service independently, so each runs its own 30-day retention purge and 1 GB cap purge against its own local `site_events.db`. The design states only "A daily background job runs on the active node and deletes all events older than 30 days." (Component-SiteEventLogging, Storage section). In practice the standby node receives no writes, so its purge finds nothing to delete and is harmless — but the implementation does not match the documented "active node" gating, and the resolution note on SiteEventLogging-004 already flagged that the *writer* runs on the standby too. The purge has the same shape. Aligning to the design is also a defence against a future change that does write to the standby (e.g. local heartbeats), and removes the per-node wake-ups that contribute to `Microsoft.Extensions.Hosting` shutdown latency. **Recommendation** Either (a) gate the purge service on "this node is the active member of `siteRole`" (check the cluster singleton ownership before each `RunPurge()`, or host the purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword the design doc to "the purge runs on every node against its own local database; on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code defect. ### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` | **Description** `LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace. The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm, deployment, connection, store_and_forward, instance_lifecycle}, `severity` ∈ {Info, Warning, Error}. Nothing in the code enforces either set. Any caller can pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted verbatim. Two follow-on consequences: 1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact match, case-sensitive by SQLite default `BINARY` collation). A row stored as `"error"` will not be returned for a query filtering on `"Error"`. The design lists severity as a first-class filter and the central UI will reasonably normalise to one casing — every row stored with a different casing is silently invisible to that filter. 2. The `Events Logged` table in the design implicitly relies on a stable `event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in silently and is hard to detect later. **Recommendation** Validate `eventType` and `severity` against a known set (or accept `enum`s on the interface, converting to canonical string at the call site). Reject unknown values with `ArgumentException` and log a single-shot warning during construction if a deployment is found to be using an unexpected value. Alternatively, normalise casing (`severity = severity.ToLowerInvariant()`) so the query filter is case-insensitive. Update the XML doc to match the enforced contract. ### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` | **Description** `ExecuteQuery` materialises rows via `DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses `CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO 8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a production node running with a non-default culture (e.g. Turkish "tr-TR", which has historically broken case-insensitive ASCII comparisons via the "Turkish-I" issue, or any culture that overrides the date/time separators) can parse incorrectly or throw `FormatException`. The exception is caught by the outer `try`, so the entire query is converted to a `Success: false` response — but the failure mode is silent and culture-dependent. The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also culture-sensitive in the same way; on a hostile-culture node, the round-trip between insert and query is not guaranteed to be lossless without explicit culture pinning. **Recommendation** Parse with explicit invariant culture and round-trip style: `DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture, DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)` emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`). Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and avoid all string-parsing. ### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Open | | Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` | **Description** The connection string is built as `$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's shared-cache mode is a *cross-connection* optimisation: it lets multiple `SqliteConnection`s in the same process share an in-process page cache. This logger owns exactly one `SqliteConnection` and serialises all access through `_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant. At best it is dead configuration; at worst it adds (very small) per-statement lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same unused option, so the smell is a copy-and-paste pattern. Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and `PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot gun if anyone later opens a second connection to the same file from another component on the same host (e.g. a tooling read-only viewer). **Recommendation** Drop `Cache=Shared` from the connection string — the logger is single-connection and gains nothing from it. If a future need to share the DB across connections in the same process arises, reintroduce it deliberately together with the busy_timeout and locking_mode review that should accompany it. ### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag | | | |--|--| | Severity | Low | | Category | Testing coverage | | Status | Open | | Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` | **Description** `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;` that the main test thread mutates after the purge task completes (`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`. The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`, and not behind a memory barrier. On a release build with a relaxed memory model the writer threads are permitted to cache the `stop = false` read indefinitely, which means in theory the test can hang past xUnit's per-test timeout instead of asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour that today happens to refresh the field across the `await _eventLogger.LogEventAsync` boundary, but that is an implementation detail rather than a contract. The test is a regression test for SiteEventLogging-003; a flaky / hang-prone version of it can mask the very behaviour it is meant to pin. **Recommendation** Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change `stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`. `CancellationTokenSource` is the canonical .NET pattern and also lets the test cooperate with xUnit's `Task.WhenAll` timeout.