ScadaBridge/code-reviews/SiteEventLogging/findings.md

# Code Review — SiteEventLogging

| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 5 |

## Summary

The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
recorder (`SiteEventLogger`), a query service with keyset pagination, a background
purge service, and a thin Akka actor bridge. The query path is parameterised
correctly (no SQL injection) and reasonably well tested. However, the storage-cap
enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because
`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the
database shrink and over-deletes the entire table when triggered. There is also a
genuine concurrency hazard: the purge service and query service share the single
`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge
running on the background thread can collide with a write or a query on another
thread. The `LogEventAsync` API is synchronous despite its name and `Task` return,
which silently blocks Akka actor threads on disk I/O. Other findings concern the
cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.

#### Re-review 2026-05-17 (commit `39d737e`)

Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
hold up under inspection — the background writer, lock-guarded `WithConnection`,
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
the concrete-recorder DI wiring are all present and correct at this commit. The
module source is byte-identical between `39d737e` and current `HEAD`, so this review
reflects the live code. Three new findings were recorded, all low-to-medium and none
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
be persisted because the logger has been disposed, the returned `Task` is completed
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block on the
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
premise does not hold on .NET 8+).

#### Re-review 2026-05-28 (commit `1eb6e97`)

Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
and their resolutions hold up under inspection: the lock-guarded `WithConnection`
overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
wiring are all present and correct at this commit. Nine new findings were recorded —
none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
gets a broken lexicographic comparison and either spuriously includes or excludes
events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
write queue can grow without limit under sustained writer slowness — sister
`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
"prevents broad queries from overwhelming the communication channel" rationale). The
remaining findings are low-severity hygiene / documentation: an unused
`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
parsing, the purge service running on the standby node, the redundant `Cache=Shared`
on a single-connection logger, and a non-volatile stop flag in a concurrency stress
test.

## Checklist coverage

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: claimed initial-purge startup-thread block (-014 — Won't Fix, premise invalid on .NET 8+). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |

_Re-review (2026-05-28, `1eb6e97`):_

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |

## Findings

### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space

| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |

**Description**

`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to
reclaim space, then re-measures the database size via `page_count * page_size`.
`incremental_vacuum` only has any effect when the database was created with
`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the
database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`,
`incremental_vacuum` is silently ignored and `page_count` does not decrease when
rows are deleted (free pages are retained in the file). Consequently the
`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The
storage-cap feature required by the design ("configurable maximum database size...
oldest events are purged first") is therefore non-functional — it cannot bring the
file back under the cap.

**Recommendation**

Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are
created (it must be set before table creation or followed by a full `VACUUM` to take
effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge
deletes, or measure logical data size (e.g. `page_count - freelist_count` times
`page_size`) instead of relying on `incremental_vacuum`.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): `InitializeSchema` now sets
`PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and
`GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) *
page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now
reliably observes the database shrinking. Regression test
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`.

### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed

| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |

**Description**

Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row
batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table
is completely empty. The design states the cap should purge "the oldest events...
first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
during an alarm storm) this destroys all retained diagnostic history rather than
trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap`
masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an
empty table, so the over-delete behaviour is never exercised against a realistic cap.

**Recommendation**

Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
the file is genuinely under the cap. Add a guard so the loop stops once
`currentSizeBytes` has stopped decreasing across iterations, and add a test with a
non-zero cap and a known oversized dataset to assert that only the oldest events are
removed.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): with the size measurement fixed
(SiteEventLogging-001) the cap loop terminates when the file is genuinely under the
cap. An additional guard stops the loop if the on-disk size fails to decrease across
an iteration, so a cap that can never be met no longer empties the whole table.
Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the
table is not emptied and the file ends under a realistic non-zero cap) and
`PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are
removed).

### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock

| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |

**Description**

`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via
`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach
into `_eventLogger.Connection` and execute commands directly, without acquiring
`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from
event-recording callers and from the actor that drives the query service). A single
`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the
purge thread and a recording thread (or query thread) can throw
`SqliteException`/`InvalidOperationException` ("DataReader already open",
"connection busy") or corrupt command state. The purge `DELETE` and the recorder
`INSERT` racing is the most likely collision because event recording is continuous.

**Recommendation**

Funnel all access to the connection through a single synchronisation point: either
expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the
purge and query services their own dedicated `SqliteConnection` instances (SQLite
supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout`
makes this safer). Do not share one `SqliteConnection` across threads.

**Resolution**

Resolved 2026-05-16 (commit `<pending>`): the raw `internal Connection` property was
removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)`
overloads that hold the existing `_writeLock` for the duration of the caller's
delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now
access the connection exclusively through `WithConnection`, so the purge thread,
query thread, and recording threads are serialised on a single lock. `Dispose` was
also brought under the lock to avoid a dispose/use race. Regression test
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running
concurrently with multiple writer threads.

### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node

| | |
|--|--|
| Severity | Low |
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
| Category | Design-document adherence |
| Status | Won't Fix |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |

**Description**

`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the
stated intent that "queries always reach the active node". However, an Akka.NET
cluster singleton is pinned to the *oldest* member of the role, which is not the
same concept as the SCADA "active node" (the node currently running the Deployment
Manager singleton / serving live traffic). The design doc is explicit: "Only the
active node generates and stores events... the new active node starts logging to its
own SQLite database." The event-log SQLite file is node-local and unreplicated.
Nothing guarantees the event-log singleton co-locates with the active node, so a
remote query can be served by the standby node and read that node's near-empty
database, returning no events even though the active node has a full log. The
explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually
happens.

**Recommendation**

Either (a) host the query handler as a normal per-node actor and route queries to
the active node explicitly (the node owning the Deployment Manager singleton), or
(b) make the event-log writer follow the same singleton so the writer and the query
handler are guaranteed co-located. Reconcile the design doc and the inline comment
with whichever model is chosen.

**Re-triage note (2026-05-16)**

The finding's central claim — that a remote query "can be served by the standby
node and read that node's near-empty database" — is incorrect for the query path.
In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the
`deployment-manager` `ClusterSingletonManager` are created with the **same role**
(`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a
given role to the *oldest member of that role* — so all same-role singletons in a
cluster co-locate on one node. The "active node" in this codebase is, by definition,
the node hosting the `deployment-manager` singleton; the event-log query singleton
is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot
land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not
"the opposite of what happens".

A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered
as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local
SQLite file on **every** node, including the standby. That wastes storage on the
standby but does **not** cause the query-returns-nothing symptom the finding
describes, because the query singleton always reads the *active* node's (populated)
database. Gating the writer to the active node would be a `ScadaLink.Host` wiring
change, outside this module's scope, and is a minor optimisation rather than a
correctness defect.

Re-triaged from High to Low and closed as **Won't Fix**: the High-severity
correctness claim does not hold. Any residual cleanup (gate the standby-node writer;
the comment needs no change) can be raised as a fresh Low finding against
`ScadaLink.Host` if desired.

**Resolution**

Won't Fix — 2026-05-16 (commit `<pending>`). Re-triaged: the asserted defect (query
served by standby returning an empty log) cannot occur because the event-log query
singleton and the deployment-manager singleton share a role and so always co-locate
on the active node. No code change made; see the re-triage note above.

### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread

| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |

**Description**

`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its
body is entirely synchronous: it takes `lock (_writeLock)`, runs
`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`.
Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)`
(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`,
`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write,
and any contention on `_writeLock`, executes inline on the Akka actor thread of the
calling subsystem. Under an event burst (alarm storm, script failure loop) this
serialises actor threads on disk I/O and the global write lock, degrading the
hot-path subsystems the design intends to keep responsive.

**Recommendation**

Either make recording genuinely asynchronous (offload to a dedicated single-threaded
writer / `Channel<T>` consumer so callers truly fire-and-forget), or rename the
method to `LogEvent` and document that it blocks, so callers can decide. Given the
design's emphasis on not impacting runtime subsystems, an internal queue with a
background flush is preferable.

**Resolution**

Resolved 2026-05-16 (commit `pending`): event recording is now offloaded to a
dedicated background writer. `SiteEventLogger` owns an unbounded `Channel<T>` and a
single background consumer thread; `LogEventAsync` only validates its arguments and
enqueues, so caller threads (Akka actor threads on hot paths) never block on the
SQLite write or on contention for the write lock. The returned `Task` completes once
the event is durably persisted (so `await` callers still observe write ordering) and
faults if the write fails. `Dispose` completes the channel and drains the writer.
Regression test `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` (verifies the
caller returns in <500 ms while the database is held busy) plus
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`.

### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths

| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` |

**Description**

`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`.
The query service also filters on `severity` (`severity = $severity`) and performs
`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no
index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a
1 GB database and a 500-row page size, severity-filtered and keyword queries do full
table scans on every page. The design explicitly lists keyword search as a supported,
expected query type.

**Recommendation**

Add an index on `severity` (or a composite index aligned with common filter
combinations such as `(event_type, severity, id)`). For keyword search, consider an
FTS5 virtual table over `message` and `source`, or accept the scan but document the
cost.

**Resolution**

Resolved 2026-05-16 (commit `pending`): `InitializeSchema` now creates
`idx_events_severity` on `site_events(severity)`, so severity-filtered queries use an
index instead of full-scanning. The leading-wildcard `LIKE` keyword search cannot use
a B-tree index by construction; rather than adding an FTS5 table, that scan is
accepted and now documented with an inline comment next to the index DDL. Regression
tests `Schema_HasIndexOnSeverity` and `SeverityFilteredQuery_UsesIndex_NotFullScan`
(the latter asserts the SQLite query plan uses `idx_events_severity` and does not
`SCAN site_events`).

### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type

| | |
|--|--|
| Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23` |

**Description**

Both `EventLogPurgeService` and `EventLogQueryService` took `ISiteEventLogger` via
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. This
made the registration fragile — any `ISiteEventLogger` that is not exactly
`SiteEventLogger` (a test double, a decorator) caused an `InvalidCastException` at
construction — and defeated the purpose of the interface abstraction.

**Re-triage note (2026-05-16)**

The finding as originally written also claimed the services "access the `internal
SqliteConnection Connection` property to run arbitrary SQL" and called itself "the
root cause of the unsynchronised connection sharing in SiteEventLogging-003". That
part is stale: the resolution of SiteEventLogging-003 had already removed the
`internal Connection` property and replaced it with lock-guarded `WithConnection`
overloads. At the time this finding was actioned, the only remaining defect was the
concrete-type downcast itself. Severity stays Medium; the description is corrected to
the downcast-only scope.

**Recommendation**

Have the purge and query services depend on the concrete `SiteEventLogger` directly
(it is the type that owns the lock-guarded `WithConnection`), and register the
concrete type in DI with the interface forwarded to the same singleton. Remove the
fragile downcasts.

**Resolution**

Resolved 2026-05-16 (commit `pending`): `AddSiteEventLogging` now registers
`SiteEventLogger` as a concrete singleton and forwards `ISiteEventLogger` to that
same instance. `EventLogPurgeService` and `EventLogQueryService` take a
`SiteEventLogger` constructor parameter directly, eliminating the
`(SiteEventLogger)eventLogger` downcast and its `InvalidCastException` risk. All
three services still share one connection/lock. Regression tests
`AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance` and
`PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast`.

### SiteEventLogging-008 — Event-recording write failures are silently swallowed

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |

**Description**

If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the
exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still
returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so
they cannot detect failure. The event log is the site's diagnostic audit trail; a
sustained write failure (for example a locked-database storm caused by the
unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
operators except a local log line that nobody is watching. There is no failure
counter, no health-metric hook, and no retry.

**Recommendation**

Expose a failure signal: increment a counter that the Health Monitoring component
can surface (the design notes script/alarm error rates are derived from the event
log — a logging outage should be visible). At minimum, escalate repeated failures to
a Warning/Error health metric rather than only a local log line.

**Resolution**

Resolved 2026-05-16 (commit `pending`): write failures are no longer swallowed. The
background writer (introduced for SiteEventLogging-005) now, on an `INSERT` failure,
(a) increments a new `Interlocked`-guarded counter exposed as the public
`SiteEventLogger.FailedWriteCount` property — which Health Monitoring can poll to
detect a logging outage — and (b) faults the `Task` returned by `LogEventAsync` with
the exception instead of returning `Task.CompletedTask`. The error is still logged.
Regression test `LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError`.

### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` |

**Description**

The interface XML doc states "Record an event asynchronously." and the method is
named `LogEventAsync`, but the implementation is fully synchronous (see
SiteEventLogging-005). The documentation and naming are misleading: a reader will
reasonably assume the write is offloaded and the caller's thread is not blocked,
which is false. The `details` parameter doc says "Optional JSON details" but nothing
validates or requires JSON, so callers may pass arbitrary text.

**Re-triage note (2026-05-16)**

The finding's central premise is now stale. SiteEventLogging-005 was resolved by
introducing a dedicated background writer: `LogEventAsync` enqueues onto an unbounded
`Channel<T>` and returns without blocking, so the method is now *genuinely*
asynchronous and the name `LogEventAsync` is accurate. No rename is warranted. The
only residual documentation defect was the imprecise XML doc (the one-line summary
did not describe the enqueue/await semantics) and the `details` doc claiming "JSON"
when nothing enforces it. Scope corrected to those doc fixes only.

**Recommendation**

Align the name, signature, and documentation with the actual behaviour — either make
the method genuinely asynchronous or rename to `LogEvent` and correct the doc.
Clarify that `details` is free-form text unless JSON is actually enforced.

**Resolution**

Resolved 2026-05-16 (commit `pending`): the `ISiteEventLogger.LogEventAsync` XML doc
now describes the actual behaviour — the call enqueues onto a background writer and
returns without blocking, and the returned `Task` completes on durable persistence
(or faults on write failure). The `details` parameter doc was corrected to "Optional
free-form detail text ... JSON is conventional but not validated or enforced". No
rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so
`LogEventAsync` is accurate (see Re-triage note). Documentation-only change; no
regression test added (behaviour is already covered by the SiteEventLogging-005
tests `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` and
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`).

### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path

| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |

**Description**

The test suite covers recording, query filtering/pagination, and basic purge, but
several critical behaviours are untested:

- `EventLogHandlerActor` has no test — the actor message contract
  (`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified.
- No test exercises purge running concurrently with active writes/queries, so the
  connection-sharing race (SiteEventLogging-003) is invisible to CI.
- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the
  no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the
  file shrinks or that only oldest events are removed under a realistic cap.
- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`,
  `ErrorMessage`) has no test.
- `SiteEventLogger.Dispose` semantics (logging after dispose returns
  `Task.CompletedTask`) and re-entrant dispose are untested.

**Recommendation**

Add tests for the actor bridge, a concurrency stress test (purge + write + query in
parallel), a realistic non-zero-cap purge test asserting size reduction and
oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).

**Resolution**

Resolved 2026-05-16 (commit `pending`): the remaining coverage gaps are now filled.
`EventLogHandlerActorTests` (using `Akka.TestKit.Xunit2`) exercises the actor message
contract — `EventLogQueryRequest` -> `EventLogQueryResponse` via `Sender.Tell`, for
both success and error responses. `EventLogCoverageTests` covers the previously
untested `EventLogQueryService.ExecuteQuery` catch block
(`ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable`) and the recorder's
disposed-state semantics (`LogEventAsync_AfterDispose_CompletesWithoutThrowing`,
`Dispose_IsIdempotent`). The purge/write concurrency, realistic-cap, and
oldest-first behaviours were already covered by the tests added when
SiteEventLogging-001/-002/-003 were resolved
(`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection`,
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`,
`PurgeByStorageCap_RemovesOldestEventsFirst`).

### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` |

**Description**

`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka
actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already
implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through
this method. The placeholder is dead code: it is either never called or called with
no effect, and the comment is stale. A reader looking for where the event-log actor
is wired up will be misdirected.

**Recommendation**

Either implement the actor registration here and have `AkkaHostedService` call it
(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove
the misleading comment.

**Resolution**

Resolved 2026-05-16 (commit `pending`): confirmed dead code — a repo-wide search
found zero callers of `AddSiteEventLoggingActors`, and `EventLogHandlerActor` is in
fact wired up directly in `ScadaLink.Host/Actors/AkkaHostedService.cs` as a cluster
singleton (it must be created inside the `ActorSystem` with a resolved
`IEventLogQueryService`, which a `IServiceCollection` extension cannot do). The empty
placeholder method and its stale "Phase 4+" comment were deleted, and a short
explanatory note added to `AddSiteEventLogging` pointing readers to where the actor
is actually registered. Documentation/dead-code change only; no regression test was
added — the change is a method removal verified by the compiler (no callers) and the
full module suite still passing.

### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |

**Description**

`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
SiteEventLogging-009), "completes once the event is durably persisted and faults if
the write fails, so callers that `await` it observe success or failure." Two paths
break that contract by signalling **success** for an event that was never written:

1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
   has been completed because the logger was disposed), the code calls
   `pending.Completion.TrySetResult()` — completing the `Task` successfully — even
   though the comment immediately above acknowledges "there is nowhere to persist the
   event."
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
   been disposed mid-drain. The code does not inspect the returned `written` flag and
   unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
   for an event the comment admits "simply cannot be persisted."

The event log is the site's diagnostic audit trail. A caller that `await`s
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
recorded will observe a *successful* completion for an event that was silently
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
*errors* — but the disposed-drop path was left reporting false success. The window
is the disposal/shutdown interval, during which shutdown-related events (graceful
singleton handover, instance disable) are exactly the events most likely to be
enqueued and lost.

**Recommendation**

For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
deliberate "drop silently on shutdown" semantics is chosen instead.

**Resolution**

Resolved 2026-05-17 (commit `<pending>`): both disposed-drop paths now fault the
returned `Task` with `ObjectDisposedException` instead of `TrySetResult()` —
`LogEventAsync` on a failed `TryWrite`, and `ProcessWriteQueueAsync` now inspects the
`written` flag from `WithConnection` and only calls `TrySetResult()` when the row was
actually persisted. An `await`-ing caller can now distinguish a dropped audit event
from a persisted one. Regression tests
`LogEventAsync_AfterDispose_FaultsTask_NotReportsSuccess` and
`LogEventAsync_EnqueuedThenDisposed_FaultsTask_WhenWriteCannotComplete` (the prior
`LogEventAsync_AfterDispose_CompletesWithoutThrowing` test asserted the now-incorrect
silent-success behaviour and was replaced).

### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |

**Description**

The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
user-supplied keyword. A search for a literal `_` (common in event sources and
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
"match any single character", and a `%` matches any run of characters. The design
calls keyword search "free-text search on message and source fields ... Useful for
finding events by script name, alarm name, or error message" — users will reasonably
expect a literal substring match, so a query for `store_and_forward` silently returns
events containing `storeXandYforward` and similar false positives. There is no way
for the caller to search for a literal underscore or percent.

**Recommendation**

Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
design implies.

**Resolution**

Resolved 2026-05-17 (commit `<pending>`): `EventLogQueryService` now escapes `\`,
`%`, and `_` in `request.KeywordFilter` via a new `EscapeLikePattern` helper before
wrapping it in `%...%`, and the `LIKE` clauses carry an explicit `ESCAPE '\'` so the
keyword is matched as a literal substring as the design intends. Regression tests
`Query_KeywordSearch_TreatsUnderscoreAsLiteral_NotWildcard`,
`Query_KeywordSearch_TreatsPercentAsLiteral_NotWildcard`, and
`Query_KeywordSearch_StillMatchesPlainSubstring`.

### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread

| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Won't Fix |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |

**Description**

`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
near-1 GB database is back under the cap — executes inline on the startup thread,
blocking host startup (and therefore the `/health/ready` gate) for as long as the
purge takes. On a site that has accumulated a large log this can be a multi-second
stall during every node start/failover. The class doc states the service "runs on a
background thread and does not block event recording" — the startup-thread block is
inconsistent with that intent.

**Recommendation**

Yield before the initial purge so it runs on the background scheduler rather than the
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.

**Re-triage note (2026-05-17)**

The finding's central premise — "the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`", so the
synchronous `RunPurge()` prelude blocks the startup thread — is **incorrect for the
.NET version this project targets**. The module targets `net10.0`. Since .NET 8,
`BackgroundService.StartAsync` no longer runs `ExecuteAsync` inline on the calling
(host startup) thread: `ExecuteAsync` is dispatched onto the thread pool, and
`StartAsync` returns immediately regardless of how long the synchronous prelude runs.

This was verified empirically. A standalone `BackgroundService` whose `ExecuteAsync`
prelude does a 1.5 s synchronous `Thread.Sleep` before its first `await` showed
`StartAsync` returning in **0 ms**, with the prelude running on a *different* thread
pool thread than the caller. Applied to `EventLogPurgeService`, `StartAsync` returns
promptly and the initial `RunPurge()` executes on the background scheduler — host
startup and the `/health/ready` gate are not blocked. There is therefore no defect
to fix.

**Resolution**

Won't Fix — 2026-05-17 (commit `<pending>`). Re-triaged: the asserted startup-thread
block cannot occur on .NET 8+ (this module targets `net10.0`), where
`BackgroundService` dispatches `ExecuteAsync` to the thread pool rather than running
its synchronous prelude on the host startup thread — verified empirically (see the
re-triage note). No code change made. A verification test
`StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
(asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
background scheduler).

### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness

| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |

**Resolution (2026-05-28):** Switched `SiteEventLogger._writeQueue` to
`Channel.CreateBounded<PendingEvent>` with `FullMode = BoundedChannelFullMode.DropOldest`.
Default capacity 10,000 via new `SiteEventLogOptions.WriteQueueCapacity`. The
`itemDropped` callback faults the dropped event's Task with
`InvalidOperationException` and increments `FailedWriteCount` so both awaiting
callers and Health Monitoring observe drops. `DropOldest` preserves the
SiteEventLogging-005 "callers never block" guarantee. Test:
`LogEventAsync_BoundedQueueDropsOldest_AndFaultsDroppedTask`.

**Description**

`SiteEventLogger` creates its background-writer feeder as
`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
makes a single iteration slow — a long-running query in `EventLogQueryService`
holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
from an alarm storm / script failure loop — drives the queue arbitrarily large.
Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
strings, so there is no upper bound on how much memory the recorder can hold.

The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
addresses the same hot-path-writer problem with
`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
giving back-pressure to producers. Site event logging picked the riskier choice for
a component that — per the design — is fed by every site subsystem (script, alarm,
deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.

**Recommendation**

Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
(default in the order of 10 000 — large enough to absorb a normal alarm burst,
small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
back-pressure (callers `await` and serialise their actor thread on the queue —
defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
chosen policy on `ISiteEventLogger.LogEventAsync`.

### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps

| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |

**Resolution** — `EventLogQueryService.ExecuteQuery` now calls `.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, so the produced ISO 8601 string always ends in `+00:00` and lexicographically matches the UTC timestamps written by `SiteEventLogger`. `EventLogPurgeService.PurgeByRetention` was also made defensive with an explicit `.ToUniversalTime()` on the cutoff. A regression test (`Query_FiltersByTimeRange_HandlesNonUtcOffset`) constructs a `+05:00` `DateTimeOffset` and asserts the matching UTC-stored events are returned and out-of-range ones are excluded.

**Description**

Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
which always emits the round-trip ISO 8601 form ending in the literal offset
`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
range using a direct string comparison:

```
whereClauses.Add("timestamp >= $from");
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
```

`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
(SQLite default `BINARY` collation), so the query either spuriously excludes events
that genuinely occurred in the range, or spuriously includes events from a wholly
different hour. The same defect applies to `To`. The retention purge does
`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
central query path is vulnerable.

The design explicitly states "All timestamps are UTC throughout the system" but the
boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
A central UI rendered in a non-UTC timezone is the most likely trigger, and the
defect silently corrupts every query that filters by time range — exactly the
filter most likely to be set on a "show me what happened around the failover" query.

**Recommendation**

Normalise `From` / `To` to UTC before serialising:
`request.From.Value.ToUniversalTime().ToString("o")` (or
`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
asserts the matching events are returned. Optionally also store timestamps as
Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
lexicographic-comparison hazard structurally.

### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale

| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |

**Resolution (2026-05-28):** `EventLogQueryService.ExecuteQuery` now clamps
`pageSize` to new `SiteEventLogOptions.MaxQueryPageSize` (default 500, matching
existing `QueryPageSize` default) before issuing the SQL. Silent clamp rather
than reject so misconfigured clients still get a usable response. Test:
`Query_PageSize_IsClampedToMaxQueryPageSize` (asserts a request of 100,000 is
clamped down with `HasMore = true`).

**Description**

`EventLogQueryService.ExecuteQuery` resolves the effective page size as
`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
"has more"). There is no upper bound. A central client — buggy or hostile — can
send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
shared write lock. This:

- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
  message serialisation limits, will then be serialised into an
  `EventLogQueryResponse` and pushed over the ClusterClient pipe.
- Blocks all writes (purge, recorder hot path) for the duration of the scan
  because the read holds `_writeLock`.
- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
  queries.

The design explicitly justifies pagination as preventing exactly this — "Results
are paginated with a configurable page size (default: 500 events) ... This prevents
broad queries from overwhelming the communication channel." The code honours the
*default* but does not enforce an *upper bound* on a client-supplied override.

**Recommendation**

Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
exceeds maximum"` reject path so a misbehaving central is told why its query is
refused.

### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |

**Description**

`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
XML doc statement "Surfaced so Health Monitoring can detect a logging outage
instead of relying on a local log line nobody is watching." The implementation is
correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
a repo-wide search shows **no** caller anywhere in `src/` reads the property —
neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
`/health` endpoint. The metric is dead-letter: a logging outage still goes
unnoticed in production, contradicting the original finding's resolution claim.

The property is also exposed only on the concrete `SiteEventLogger`, not on
`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
take a concrete-type dependency (`internal Connection` removed, but
`FailedWriteCount` remained concrete-only).

**Recommendation**

Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
softening the XML doc to "Available for future Health Monitoring integration" and
file a tracking item for the wiring. The current doc claim is misleading.

### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"

| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:57-95`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:30-39`, `docs/requirements/Component-SiteEventLogging.md:45` |

**Description**

`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
which registers the purge `BackgroundService` per host. On a 2-node site cluster
both `node-a` and `node-b` start the service independently, so each runs its own
30-day retention purge and 1 GB cap purge against its own local
`site_events.db`. The design states only "A daily background job runs on the
active node and deletes all events older than 30 days." (Component-SiteEventLogging,
Storage section). In practice the standby node receives no writes, so its purge
finds nothing to delete and is harmless — but the implementation does not match the
documented "active node" gating, and the resolution note on SiteEventLogging-004
already flagged that the *writer* runs on the standby too. The purge has the same
shape.

Aligning to the design is also a defence against a future change that does write
to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
contribute to `Microsoft.Extensions.Hosting` shutdown latency.

**Recommendation**

Either (a) gate the purge service on "this node is the active member of `siteRole`"
(check the cluster singleton ownership before each `RunPurge()`, or host the
purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
the design doc to "the purge runs on every node against its own local database;
on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
defect.

**Resolution (2026-05-28):** Took option (a) at the loop level — registration
stays unchanged on every host. Introduced a `SiteEventLogActiveNodeCheck`
delegate that `EventLogPurgeService` consults at the top of every
`RunPurge()` tick; standby returns early with a debug log. The DI factory
resolves the delegate from the container so the Host can register the real
check on a site node, and a null/unregistered delegate falls back to the
prior "always run" behaviour (backward compatible for non-clustered hosts and
existing tests). Defensive try/catch around the check defaults to "run" so a
transient cluster-state read failure cannot stop the purge loop. Added tests
`RunPurge_OnStandbyNode_SkipsAllWork`,
`RunPurge_OnActiveNode_RunsTheRetentionPurge`, and
`RunPurge_WithNullCheck_FallsBackToRunning`. Wiring the real check on the
Host's site-role branch is left for the Host's review. Tests green (50/50 in
SiteEventLogging.Tests).

### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |

**Resolution (2026-05-28):** `LogEventAsync` now validates `severity` against
the closed set `{Info, Warning, Error}` (case-sensitive, matches SQLite
BINARY collation used by the query filter) and throws `ArgumentException`
naming the offending value. `eventType` left intentionally free-form (design
enumerates an open category set). Tests: `LogEventAsync_ThrowsOnUnknownSeverity`
(5 cases) and `LogEventAsync_AcceptsAllDocumentedSeverities` (3 cases).

**Description**

`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
deployment, connection, store_and_forward, instance_lifecycle}, `severity` ∈
{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
verbatim. Two follow-on consequences:

1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
   match, case-sensitive by SQLite default `BINARY` collation). A row stored as
   `"error"` will not be returned for a query filtering on `"Error"`. The design
   lists severity as a first-class filter and the central UI will reasonably
   normalise to one casing — every row stored with a different casing is silently
   invisible to that filter.
2. The `Events Logged` table in the design implicitly relies on a stable
   `event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
   silently and is hard to detect later.

**Recommendation**

Validate `eventType` and `severity` against a known set (or accept `enum`s on the
interface, converting to canonical string at the call site). Reject unknown values
with `ArgumentException` and log a single-shot warning during construction if a
deployment is found to be using an unexpected value. Alternatively, normalise
casing (`severity = severity.ToLowerInvariant()`) so the query filter is
case-insensitive. Update the XML doc to match the enforced contract.

### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |

**Resolution (2026-05-28):** `EventLogQueryService.ExecuteQuery` now parses the
stored ISO 8601 timestamp with `CultureInfo.InvariantCulture` plus
`DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal`, so the
result is locale-independent and guaranteed UTC. No new test was added — the
recorder still writes via `DateTimeOffset.UtcNow.ToString("o")` which is itself
invariant-safe in practice, and the existing query/time-range tests continue
to pass.

**Description**

`ExecuteQuery` materialises rows via
`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
production node running with a non-default culture (e.g. Turkish "tr-TR", which
has historically broken case-insensitive ASCII comparisons via the
"Turkish-I" issue, or any culture that overrides the date/time separators) can
parse incorrectly or throw `FormatException`. The exception is caught by the outer
`try`, so the entire query is converted to a `Success: false` response — but the
failure mode is silent and culture-dependent.

The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
culture-sensitive in the same way; on a hostile-culture node, the round-trip
between insert and query is not guaranteed to be lossless without explicit
culture pinning.

**Recommendation**

Parse with explicit invariant culture and round-trip style:
`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
avoid all string-parsing.

### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger

| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |

**Description**

The connection string is built as
`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
shared-cache mode is a *cross-connection* optimisation: it lets multiple
`SqliteConnection`s in the same process share an in-process page cache. This
logger owns exactly one `SqliteConnection` and serialises all access through
`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
At best it is dead configuration; at worst it adds (very small) per-statement
lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
unused option, so the smell is a copy-and-paste pattern.

Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
gun if anyone later opens a second connection to the same file from another
component on the same host (e.g. a tooling read-only viewer).

**Recommendation**

Drop `Cache=Shared` from the connection string — the logger is single-connection
and gains nothing from it. If a future need to share the DB across connections in
the same process arises, reintroduce it deliberately together with the busy_timeout
and locking_mode review that should accompany it.

### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag

| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |

**Description**

`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
that the main test thread mutates after the purge task completes
(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
and not behind a memory barrier. On a release build with a relaxed memory model
the writer threads are permitted to cache the `stop = false` read indefinitely,
which means in theory the test can hang past xUnit's per-test timeout instead of
asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
boundary, but that is an implementation detail rather than a contract.

The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
version of it can mask the very behaviour it is meant to pin.

**Recommendation**

Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
cooperate with xUnit's `Task.WhenAll` timeout.