Files

T

Joseph Doherty d190345ef0 test(coverage): close Theme 8 — 13 test-coverage findings, +35 tests

13 well-bounded test-coverage gaps closed across 11 test projects.
Net +35 regression tests; no production code changes except the
SiteEventLogger src reference unchanged (W3 redacted only test code).

Test additions:
- CLI-022: CommandTreeTests pinned-count assertion bumped 14→16 and
  3 InlineData rows added for the audit + bundle command groups.
- Commons-020: new TransportRecordsTests covers BundleManifest /
  ExportSelection / ImportPreview / ImportResolution / ImportResult —
  ctor + System.Text.Json round-trip + record-equality (14 tests).
- CD-024: SPLIT-RANGE failure-continuation now under
  EnsureLookahead_SecondSplitThrows_LoopAborts_FirstBoundaryStillCommitted
  (Skippable MS-SQL fixture); production-shape rowversion delete
  asserted by DeleteDeploymentRecord_CurrentRowVersion_StubAttachPath_DeleteSucceeds.
- CentralUI-033: new QueryStringDrillInTests with 4 bUnit cases for
  Transport + SiteCalls drill-in / query-string handling.
- DM-024: probe actors (ReconcileProbeActor, SerializationProbeActor,
  ArtifactProbeActor) refactored from static fields to per-test instances
  (Interlocked on counter) — all 31 callers updated; no production
  changes required.
- HM-022: real-time PeriodicTimer test flake fixed by replacing
  fixed-budget Task.Delay with a RunLoopUntil poll-until-condition
  helper (5s/25ms). Production loop untouched.
- InboundAPI-023: new EndpointExtensionsTests covers the
  POST /api/{methodName} composition wiring via TestServer (7 cases:
  happy path, missing key 401, unknown method 403, invalid JSON 400,
  missing param 400, script-throws 500 sanitised, AuditActorItemKey
  stash invariant).
- MgmtSvc-021: 6 new ManagementActorTests cover the Transport bundle
  handlers (role gate for Export/Preview/Import, unknown-name
  ManagementCommandException, blocker-rejection, dedupe last-write-wins).
- SCA-006: SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow
  pins the missing boundary case.
- SEL-023: stress-test `bool stop` promoted to `volatile bool` for
  cross-thread visibility under release/JIT.

Verify-only resolutions:
- NS-024: closed by NS-019 (commit ac96b83 deletion of
  NotificationDeliveryService + its test file). No edits needed.
- NotifOutbox-008: FallbackMaxRetries/FallbackRetryDelay are private
  forward-compat constants returned only when no SMTP-config row exists
  (in which case EmailNotificationDeliveryAdapter returns Permanent,
  bypassing the values entirely). Marked Resolved with note.
- Transport-010: Overwrite child-collection sync covered by the T-001/
  T-002 tests added in commit e3ca9af; per-IP throttle by
  BundleUnlockRateLimiterTests; failed-session retention by
  BundleSessionStoreTests; T-009 closed structurally via AsyncLocal.
  Marked Resolved by reference.

Build clean; all 11 affected test suites green. README regenerated:
33 open (was 46).

2026-05-28 08:21:03 -04:00

63 KiB

Raw Blame History

Code Review — SiteEventLogging

Field	Value
Module	`src/ScadaLink.SiteEventLogging`
Design doc	`docs/requirements/Component-SiteEventLogging.md`
Status	Reviewed
Last reviewed	2026-05-28
Reviewer	claude-agent
Commit reviewed	`1eb6e97`
Open findings	3

Summary

The SiteEventLogging module is small and broadly well-structured: a SQLite-backed recorder (SiteEventLogger), a query service with keyset pagination, a background purge service, and a thin Akka actor bridge. The query path is parameterised correctly (no SQL injection) and reasonably well tested. However, the storage-cap enforcement is functionally broken: PRAGMA incremental_vacuum is a no-op because auto_vacuum = INCREMENTAL is never set, so the cap-purge loop never sees the database shrink and over-deletes the entire table when triggered. There is also a genuine concurrency hazard: the purge service and query service share the single SqliteConnection owned by SiteEventLogger but bypass its _writeLock, so a purge running on the background thread can collide with a write or a query on another thread. The LogEventAsync API is synchronous despite its name and Task return, which silently blocks Akka actor threads on disk I/O. Other findings concern the cluster-singleton placement of the handler actor (which can pin to the standby node), missing indexes for common query filters, retention/cap purge not enforcing the requirement strictly, and several documentation/maintainability issues.

Re-review 2026-05-17 (commit `39d737e`)

Re-reviewed the module at commit 39d737e. All eleven prior findings remain closed (SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions hold up under inspection — the background writer, lock-guarded WithConnection, auto_vacuum = INCREMENTAL plus logical-size measurement, the severity index, and the concrete-recorder DI wiring are all present and correct at this commit. The module source is byte-identical between 39d737e and current HEAD, so this review reflects the live code. Three new findings were recorded, all low-to-medium and none regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness gap left by the SiteEventLogging-005 background-writer rework: when an event cannot be persisted because the logger has been disposed, the returned Task is completed successfully rather than faulted, so an await-ing caller is told a dropped audit event was written. The other two are minor: unescaped SQL LIKE wildcards in the keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block on the host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the premise does not hold on .NET 8+).

Re-review 2026-05-28 (commit `1eb6e97`)

Re-reviewed the module at commit 1eb6e97. All fourteen prior findings remain closed and their resolutions hold up under inspection: the lock-guarded WithConnection overloads, the background-writer Channel<T> with disposed-mid-drain fault propagation, the auto_vacuum = INCREMENTAL schema + logical-size measurement, the severity index, the LIKE keyword-search escaping, and the concrete-recorder DI wiring are all present and correct at this commit. Nine new findings were recorded — none are regressions of prior fixes. The most notable (SiteEventLogging-016, High) is a correctness defect in the query path: timestamps are stored as ISO 8601 strings generated from DateTimeOffset.UtcNow (so they always have a +00:00 offset suffix), but the From/To filters are stringified verbatim via request.From.Value.ToString("o") without normalising to UTC, so a central client that sends a non-UTC DateTimeOffset gets a broken lexicographic comparison and either spuriously includes or excludes events. The next-most-notable findings are SiteEventLogging-015 (unbounded background write queue can grow without limit under sustained writer slowness — sister SqliteAuditWriter uses a bounded channel) and SiteEventLogging-017 (the central client's PageSize is used verbatim with no upper-bound clamp, defeating the design's "prevents broad queries from overwhelming the communication channel" rationale). The remaining findings are low-severity hygiene / documentation: an unused FailedWriteCount metric, untyped severity/event-type fields, non-invariant culture parsing, the purge service running on the standby node, the redundant Cache=Shared on a single-connection logger, and a non-volatile stop flag in a concurrency stress test.

Checklist coverage

#	Category	Examined	Notes
1	Correctness & logic bugs	☑	`incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013).
2	Akka.NET conventions	☑	Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate.
3	Concurrency & thread safety	☑	Shared `SqliteConnection` used by purge/query without the write lock (-003).
4	Error handling & resilience	☑	`LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly.
5	Security	☑	Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding.
6	Performance & resource management	☑	Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: claimed initial-purge startup-thread block (-014 — Won't Fix, premise invalid on .NET 8+).
7	Design-document adherence	☑	Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002).
8	Code organization & conventions	☑	Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007).
9	Testing coverage	☑	No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010).
10	Documentation & comments	☑	`LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011).

Re-review (2026-05-28, 1eb6e97):

#	Category	Examined	Notes
1	Correctness & logic bugs	☑	`From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020).
2	Akka.NET conventions	☑	`EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings.
3	Concurrency & thread safety	☑	Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003.
4	Error handling & resilience	☑	`FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018).
5	Security	☑	Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns.
6	Performance & resource management	☑	Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017).
7	Design-document adherence	☑	`EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019).
8	Code organization & conventions	☑	`FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018).
9	Testing coverage	☑	Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`.
10	Documentation & comments	☑	`FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020).

Findings

SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55`

Description

PurgeByStorageCap issues PRAGMA incremental_vacuum after each delete batch to reclaim space, then re-measures the database size via page_count * page_size. incremental_vacuum only has any effect when the database was created with auto_vacuum = INCREMENTAL. InitializeSchema never sets auto_vacuum, so the database uses the SQLite default (auto_vacuum = NONE). With NONE, incremental_vacuum is silently ignored and page_count does not decrease when rows are deleted (free pages are retained in the file). Consequently the while (currentSizeBytes > capBytes) loop never observes the size dropping. The storage-cap feature required by the design ("configurable maximum database size... oldest events are purged first") is therefore non-functional — it cannot bring the file back under the cap.

Recommendation

Set PRAGMA auto_vacuum = INCREMENTAL in InitializeSchema before any tables are created (it must be set before table creation or followed by a full VACUUM to take effect on an existing database). Alternatively, run a full VACUUM after cap-purge deletes, or measure logical data size (e.g. page_count - freelist_count times page_size) instead of relying on incremental_vacuum.

Resolution

Resolved 2026-05-16 (commit <pending>): InitializeSchema now sets PRAGMA auto_vacuum = INCREMENTAL before any table is created, and GetDatabaseSizeBytes measures logical size as (page_count - freelist_count) * page_size so reclaimed pages no longer mask the size drop. The cap-purge loop now reliably observes the database shrinking. Regression test PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable.

SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed


Severity	High
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105`

Description

Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch, so currentSizeBytes stays above capBytes. The loop then keeps deleting 1000-row batches on every iteration until ExecuteNonQuery returns 0 — i.e. until the table is completely empty. The design states the cap should purge "the oldest events... first" to stay within budget, not wipe the whole log. When the cap is hit (e.g. during an alarm storm) this destroys all retained diagnostic history rather than trimming it to the budget. The unit test PurgeByStorageCap_DeletesOldestWhenOverCap masks the problem because it uses MaxStorageMb = 0, which legitimately expects an empty table, so the over-delete behaviour is never exercised against a realistic cap.

Recommendation

Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when the file is genuinely under the cap. Add a guard so the loop stops once currentSizeBytes has stopped decreasing across iterations, and add a test with a non-zero cap and a known oversized dataset to assert that only the oldest events are removed.

Resolution

Resolved 2026-05-16 (commit <pending>): with the size measurement fixed (SiteEventLogging-001) the cap loop terminates when the file is genuinely under the cap. An additional guard stops the loop if the on-disk size fails to decrease across an iteration, so a cap that can never be met no longer empties the whole table. Regression tests PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable (asserts the table is not emptied and the file ends under a realistic non-zero cap) and PurgeByStorageCap_RemovesOldestEventsFirst (asserts only the oldest events are removed).

SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock


Severity	High
Category	Concurrency & thread safety
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72`

Description

SiteEventLogger owns a single SqliteConnection and serialises its own writes via lock (_writeLock). EventLogPurgeService and EventLogQueryService both reach into _eventLogger.Connection and execute commands directly, without acquiring _writeLock. The purge runs on a BackgroundService thread (a different thread from event-recording callers and from the actor that drives the query service). A single SqliteConnection / SqliteCommand is not thread-safe; concurrent use from the purge thread and a recording thread (or query thread) can throw SqliteException/InvalidOperationException ("DataReader already open", "connection busy") or corrupt command state. The purge DELETE and the recorder INSERT racing is the most likely collision because event recording is continuous.

Recommendation

Funnel all access to the connection through a single synchronisation point: either expose lock-guarded methods on SiteEventLogger for purge/query to call, or give the purge and query services their own dedicated SqliteConnection instances (SQLite supports multiple connections to the same file; Cache=Shared plus a busy_timeout makes this safer). Do not share one SqliteConnection across threads.

Resolution

Resolved 2026-05-16 (commit <pending>): the raw internal Connection property was removed from SiteEventLogger and replaced with lock-guarded WithConnection(...) overloads that hold the existing _writeLock for the duration of the caller's delegate. EventLogPurgeService, EventLogQueryService, and LogEventAsync all now access the connection exclusively through WithConnection, so the purge thread, query thread, and recording threads are serialised on a single lock. Dispose was also brought under the lock to avoid a dispose/use race. Regression test PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection exercises purge running concurrently with multiple writer threads.

SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node


Severity	Low
Original severity	High (re-triaged down to Low on 2026-05-16 — see Re-triage note)
Category	Design-document adherence
Status	Won't Fix
Location	`src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25`

Description

EventLogHandlerActor is hosted as a ClusterSingletonManager singleton with the stated intent that "queries always reach the active node". However, an Akka.NET cluster singleton is pinned to the oldest member of the role, which is not the same concept as the SCADA "active node" (the node currently running the Deployment Manager singleton / serving live traffic). The design doc is explicit: "Only the active node generates and stores events... the new active node starts logging to its own SQLite database." The event-log SQLite file is node-local and unreplicated. Nothing guarantees the event-log singleton co-locates with the active node, so a remote query can be served by the standby node and read that node's near-empty database, returning no events even though the active node has a full log. The explanatory comment in AkkaHostedService.cs asserts the opposite of what actually happens.

Recommendation

Either (a) host the query handler as a normal per-node actor and route queries to the active node explicitly (the node owning the Deployment Manager singleton), or (b) make the event-log writer follow the same singleton so the writer and the query handler are guaranteed co-located. Reconcile the design doc and the inline comment with whichever model is chosen.

Re-triage note (2026-05-16)

The finding's central claim — that a remote query "can be served by the standby node and read that node's near-empty database" — is incorrect for the query path. In AkkaHostedService.cs the event-log-handler ClusterSingletonManager and the deployment-manager ClusterSingletonManager are created with the same role (siteRole) in the same cluster. Akka.NET pins every cluster singleton of a given role to the oldest member of that role — so all same-role singletons in a cluster co-locate on one node. The "active node" in this codebase is, by definition, the node hosting the deployment-manager singleton; the event-log query singleton is therefore guaranteed to run on that same node. A ClusterClient query cannot land on the standby. The inline comment in AkkaHostedService.cs is accurate, not "the opposite of what happens".

A real but distinct concern exists: the writer (SiteEventLogger) is registered as a plain per-node DI singleton (AddSiteEventLogging), so it records to a local SQLite file on every node, including the standby. That wastes storage on the standby but does not cause the query-returns-nothing symptom the finding describes, because the query singleton always reads the active node's (populated) database. Gating the writer to the active node would be a ScadaLink.Host wiring change, outside this module's scope, and is a minor optimisation rather than a correctness defect.

Re-triaged from High to Low and closed as Won't Fix: the High-severity correctness claim does not hold. Any residual cleanup (gate the standby-node writer; the comment needs no change) can be raised as a fresh Low finding against ScadaLink.Host if desired.

Resolution

Won't Fix — 2026-05-16 (commit <pending>). Re-triaged: the asserted defect (query served by standby returning an empty log) cannot occur because the event-log query singleton and the deployment-manager singleton share a role and so always co-locate on the active node. No code change made; see the re-triage note above.

SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread


Severity	Medium
Category	Performance & resource management
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99`

Description

LogEventAsync is declared async-shaped (returns Task, Async suffix) but its body is entirely synchronous: it takes lock (_writeLock), runs cmd.ExecuteNonQuery() (a blocking SQLite write), then returns Task.CompletedTask. Callers across the codebase invoke it fire-and-forget as _ = LogEventAsync(...) (e.g. ScriptExecutionActor.cs:133, DataConnectionActor.cs:292, ScriptActor.cs:250) expecting it to be non-blocking. In reality the SQLite write, and any contention on _writeLock, executes inline on the Akka actor thread of the calling subsystem. Under an event burst (alarm storm, script failure loop) this serialises actor threads on disk I/O and the global write lock, degrading the hot-path subsystems the design intends to keep responsive.

Recommendation

Either make recording genuinely asynchronous (offload to a dedicated single-threaded writer / Channel<T> consumer so callers truly fire-and-forget), or rename the method to LogEvent and document that it blocks, so callers can decide. Given the design's emphasis on not impacting runtime subsystems, an internal queue with a background flush is preferable.

Resolution

Resolved 2026-05-16 (commit pending): event recording is now offloaded to a dedicated background writer. SiteEventLogger owns an unbounded Channel<T> and a single background consumer thread; LogEventAsync only validates its arguments and enqueues, so caller threads (Akka actor threads on hot paths) never block on the SQLite write or on contention for the write lock. The returned Task completes once the event is durably persisted (so await callers still observe write ordering) and faults if the write fails. Dispose completes the channel and drains the writer. Regression test LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow (verifies the caller returns in <500 ms while the database is held busy) plus LogEventAsync_TaskCompletes_AfterEventIsPersisted.

SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths


Severity	Low
Category	Performance & resource management
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81`

Description

InitializeSchema creates indexes on timestamp, event_type, and instance_id. The query service also filters on severity (severity = $severity) and performs message LIKE '%...%' / source LIKE '%...%' keyword search. severity has no index, and a leading-wildcard LIKE cannot use a normal index at all. With up to a 1 GB database and a 500-row page size, severity-filtered and keyword queries do full table scans on every page. The design explicitly lists keyword search as a supported, expected query type.

Recommendation

Add an index on severity (or a composite index aligned with common filter combinations such as (event_type, severity, id)). For keyword search, consider an FTS5 virtual table over message and source, or accept the scan but document the cost.

Resolution

Resolved 2026-05-16 (commit pending): InitializeSchema now creates idx_events_severity on site_events(severity), so severity-filtered queries use an index instead of full-scanning. The leading-wildcard LIKE keyword search cannot use a B-tree index by construction; rather than adding an FTS5 table, that scan is accepted and now documented with an inline comment next to the index DDL. Regression tests Schema_HasIndexOnSeverity and SeverityFilteredQuery_UsesIndex_NotFullScan (the latter asserts the SQLite query plan uses idx_events_severity and does not SCAN site_events).

SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type


Severity	Medium (partially re-triaged 2026-05-16 — see Re-triage note)
Category	Code organization & conventions
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23`

Description

Both EventLogPurgeService and EventLogQueryService took ISiteEventLogger via DI and immediately downcast it: _eventLogger = (SiteEventLogger)eventLogger;. This made the registration fragile — any ISiteEventLogger that is not exactly SiteEventLogger (a test double, a decorator) caused an InvalidCastException at construction — and defeated the purpose of the interface abstraction.

Re-triage note (2026-05-16)

The finding as originally written also claimed the services "access the internal SqliteConnection Connection property to run arbitrary SQL" and called itself "the root cause of the unsynchronised connection sharing in SiteEventLogging-003". That part is stale: the resolution of SiteEventLogging-003 had already removed the internal Connection property and replaced it with lock-guarded WithConnection overloads. At the time this finding was actioned, the only remaining defect was the concrete-type downcast itself. Severity stays Medium; the description is corrected to the downcast-only scope.

Recommendation

Have the purge and query services depend on the concrete SiteEventLogger directly (it is the type that owns the lock-guarded WithConnection), and register the concrete type in DI with the interface forwarded to the same singleton. Remove the fragile downcasts.

Resolution

Resolved 2026-05-16 (commit pending): AddSiteEventLogging now registers SiteEventLogger as a concrete singleton and forwards ISiteEventLogger to that same instance. EventLogPurgeService and EventLogQueryService take a SiteEventLogger constructor parameter directly, eliminating the (SiteEventLogger)eventLogger downcast and its InvalidCastException risk. All three services still share one connection/lock. Regression tests AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance and PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast.

SiteEventLogging-008 — Event-recording write failures are silently swallowed


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95`

Description

If ExecuteNonQuery throws (disk full, database locked, file corruption), the exception is caught, written to ILogger, and discarded; LogEventAsync still returns Task.CompletedTask as if successful. Callers fire-and-forget the result so they cannot detect failure. The event log is the site's diagnostic audit trail; a sustained write failure (for example a locked-database storm caused by the unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to operators except a local log line that nobody is watching. There is no failure counter, no health-metric hook, and no retry.

Recommendation

Expose a failure signal: increment a counter that the Health Monitoring component can surface (the design notes script/alarm error rates are derived from the event log — a logging outage should be visible). At minimum, escalate repeated failures to a Warning/Error health metric rather than only a local log line.

Resolution

Resolved 2026-05-16 (commit pending): write failures are no longer swallowed. The background writer (introduced for SiteEventLogging-005) now, on an INSERT failure, (a) increments a new Interlocked-guarded counter exposed as the public SiteEventLogger.FailedWriteCount property — which Health Monitoring can poll to detect a logging outage — and (b) faults the Task returned by LogEventAsync with the exception instead of returning Task.CompletedTask. The error is still logged. Regression test LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError.

SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57`

Description

The interface XML doc states "Record an event asynchronously." and the method is named LogEventAsync, but the implementation is fully synchronous (see SiteEventLogging-005). The documentation and naming are misleading: a reader will reasonably assume the write is offloaded and the caller's thread is not blocked, which is false. The details parameter doc says "Optional JSON details" but nothing validates or requires JSON, so callers may pass arbitrary text.

Re-triage note (2026-05-16)

The finding's central premise is now stale. SiteEventLogging-005 was resolved by introducing a dedicated background writer: LogEventAsync enqueues onto an unbounded Channel<T> and returns without blocking, so the method is now genuinely asynchronous and the name LogEventAsync is accurate. No rename is warranted. The only residual documentation defect was the imprecise XML doc (the one-line summary did not describe the enqueue/await semantics) and the details doc claiming "JSON" when nothing enforces it. Scope corrected to those doc fixes only.

Recommendation

Align the name, signature, and documentation with the actual behaviour — either make the method genuinely asynchronous or rename to LogEvent and correct the doc. Clarify that details is free-form text unless JSON is actually enforced.

Resolution

Resolved 2026-05-16 (commit pending): the ISiteEventLogger.LogEventAsync XML doc now describes the actual behaviour — the call enqueues onto a background writer and returns without blocking, and the returned Task completes on durable persistence (or faults on write failure). The details parameter doc was corrected to "Optional free-form detail text ... JSON is conventional but not validated or enforced". No rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so LogEventAsync is accurate (see Re-triage note). Documentation-only change; no regression test added (behaviour is already covered by the SiteEventLogging-005 tests LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow and LogEventAsync_TaskCompletes_AfterEventIsPersisted).

SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path


Severity	Medium
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.SiteEventLogging.Tests/`

Description

The test suite covers recording, query filtering/pagination, and basic purge, but several critical behaviours are untested:

EventLogHandlerActor has no test — the actor message contract (EventLogQueryRequest -> EventLogQueryResponse, Sender.Tell) is unverified.
No test exercises purge running concurrently with active writes/queries, so the connection-sharing race (SiteEventLogging-003) is invisible to CI.
PurgeByStorageCap is only tested with MaxStorageMb = 0, which hides the no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the file shrinks or that only oldest events are removed under a realistic cap.
EventLogQueryService.ExecuteQuery's catch block (Success: false, ErrorMessage) has no test.
SiteEventLogger.Dispose semantics (logging after dispose returns Task.CompletedTask) and re-entrant dispose are untested.

Recommendation

Add tests for the actor bridge, a concurrency stress test (purge + write + query in parallel), a realistic non-zero-cap purge test asserting size reduction and oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).

Resolution

Resolved 2026-05-16 (commit pending): the remaining coverage gaps are now filled. EventLogHandlerActorTests (using Akka.TestKit.Xunit2) exercises the actor message contract — EventLogQueryRequest -> EventLogQueryResponse via Sender.Tell, for both success and error responses. EventLogCoverageTests covers the previously untested EventLogQueryService.ExecuteQuery catch block (ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable) and the recorder's disposed-state semantics (LogEventAsync_AfterDispose_CompletesWithoutThrowing, Dispose_IsIdempotent). The purge/write concurrency, realistic-cap, and oldest-first behaviours were already covered by the tests added when SiteEventLogging-001/-002/-003 were resolved (PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection, PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable, PurgeByStorageCap_RemovesOldestEventsFirst).

SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22`

Description

AddSiteEventLoggingActors is an empty method with a comment "Placeholder for Akka actor registration (Phase 4+)". The actor (EventLogHandlerActor) is in fact already implemented and is registered directly in AkkaHostedService.cs:313-336, not through this method. The placeholder is dead code: it is either never called or called with no effect, and the comment is stale. A reader looking for where the event-log actor is wired up will be misdirected.

Recommendation

Either implement the actor registration here and have AkkaHostedService call it (centralising the wiring), or delete AddSiteEventLoggingActors entirely and remove the misleading comment.

Resolution

Resolved 2026-05-16 (commit pending): confirmed dead code — a repo-wide search found zero callers of AddSiteEventLoggingActors, and EventLogHandlerActor is in fact wired up directly in ScadaLink.Host/Actors/AkkaHostedService.cs as a cluster singleton (it must be created inside the ActorSystem with a resolved IEventLogQueryService, which a IServiceCollection extension cannot do). The empty placeholder method and its stale "Phase 4+" comment were deleted, and a short explanatory note added to AddSiteEventLogging pointing readers to where the actor is actually registered. Documentation/dead-code change only; no regression test was added — the change is a method removal verified by the compiler (no callers) and the full module suite still passing.

SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted


Severity	Medium
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197`

Description

LogEventAsync returns a Task that, per the interface XML doc (corrected under SiteEventLogging-009), "completes once the event is durably persisted and faults if the write fails, so callers that await it observe success or failure." Two paths break that contract by signalling success for an event that was never written:

In LogEventAsync, if _writeQueue.Writer.TryWrite(pending) fails (the channel has been completed because the logger was disposed), the code calls pending.Completion.TrySetResult() — completing the Task successfully — even though the comment immediately above acknowledges "there is nowhere to persist the event."
In ProcessWriteQueueAsync, WithConnection returns false when the logger has been disposed mid-drain. The code does not inspect the returned written flag and unconditionally calls pending.Completion.TrySetResult(), again reporting success for an event the comment admits "simply cannot be persisted."

The event log is the site's diagnostic audit trail. A caller that awaits LogEventAsync to confirm a critical event (deployment applied, alarm activated) was recorded will observe a successful completion for an event that was silently dropped. This is the same class of defect SiteEventLogging-008 fixed for write errors — but the disposed-drop path was left reporting false success. The window is the disposal/shutdown interval, during which shutdown-related events (graceful singleton handover, instance disable) are exactly the events most likely to be enqueued and lost.

Recommendation

For both paths, fault the Task (or complete it with a sentinel failure) instead of TrySetResult() — e.g. pending.Completion.TrySetException(new ObjectDisposedException(...)) — so an await-ing caller can distinguish a dropped event from a persisted one. Inspect the written flag returned by WithConnection in ProcessWriteQueueAsync and only call TrySetResult() when written is true. Update the XML doc if a deliberate "drop silently on shutdown" semantics is chosen instead.

Resolution

Resolved 2026-05-17 (commit <pending>): both disposed-drop paths now fault the returned Task with ObjectDisposedException instead of TrySetResult() — LogEventAsync on a failed TryWrite, and ProcessWriteQueueAsync now inspects the written flag from WithConnection and only calls TrySetResult() when the row was actually persisted. An await-ing caller can now distinguish a dropped audit event from a persisted one. Regression tests LogEventAsync_AfterDispose_FaultsTask_NotReportsSuccess and LogEventAsync_EnqueuedThenDisposed_FaultsTask_WhenWriteCannotComplete (the prior LogEventAsync_AfterDispose_CompletesWithoutThrowing test asserted the now-incorrect silent-success behaviour and was replaced).

SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83`

Description

The keyword-search filter builds the LIKE pattern as $"%{request.KeywordFilter}%" and binds it as a parameter. Parameterisation correctly prevents SQL injection, but it does not neutralise the LIKE metacharacters % and _ inside the user-supplied keyword. A search for a literal _ (common in event sources and identifiers such as store_and_forward, PLC_1, or instance IDs) is interpreted as "match any single character", and a % matches any run of characters. The design calls keyword search "free-text search on message and source fields ... Useful for finding events by script name, alarm name, or error message" — users will reasonably expect a literal substring match, so a query for store_and_forward silently returns events containing storeXandYforward and similar false positives. There is no way for the caller to search for a literal underscore or percent.

Recommendation

Escape %, _, and the escape character itself in request.KeywordFilter before wrapping it in %...%, and append an ESCAPE clause to the LIKE expression (e.g. ... LIKE $keyword ESCAPE '\'). Alternatively document that the keyword field accepts LIKE wildcard syntax, but a literal-substring match is the behaviour the design implies.

Resolution

Resolved 2026-05-17 (commit <pending>): EventLogQueryService now escapes \, %, and _ in request.KeywordFilter via a new EscapeLikePattern helper before wrapping it in %...%, and the LIKE clauses carry an explicit ESCAPE '\' so the keyword is matched as a literal substring as the design intends. Regression tests Query_KeywordSearch_TreatsUnderscoreAsLiteral_NotWildcard, Query_KeywordSearch_TreatsPercentAsLiteral_NotWildcard, and Query_KeywordSearch_StillMatchesPlainSubstring.

SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread


Severity	Low
Category	Performance & resource management
Status	Won't Fix
Location	`src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48`

Description

EventLogPurgeService.ExecuteAsync calls RunPurge() (a fully synchronous method that runs PurgeByRetention and PurgeByStorageCap) before the first await (await timer.WaitForNextTickAsync(...)). A BackgroundService's ExecuteAsync is invoked from StartAsync, and the host's startup pipeline does not proceed past a BackgroundService until its ExecuteAsync yields at the first real await. Because RunPurge() precedes any await, the entire initial purge — including a cap-purge that deletes rows in 1000-row batches and runs PRAGMA incremental_vacuum until a near-1 GB database is back under the cap — executes inline on the startup thread, blocking host startup (and therefore the /health/ready gate) for as long as the purge takes. On a site that has accumulated a large log this can be a multi-second stall during every node start/failover. The class doc states the service "runs on a background thread and does not block event recording" — the startup-thread block is inconsistent with that intent.

Recommendation

Yield before the initial purge so it runs on the background scheduler rather than the startup thread — e.g. await Task.Yield(); as the first statement of ExecuteAsync, or move the initial RunPurge() to after the first await timer.WaitForNextTickAsync (accepting a one-interval delay), or offload it with await Task.Run(RunPurge, stoppingToken).

Re-triage note (2026-05-17)

The finding's central premise — "the host's startup pipeline does not proceed past a BackgroundService until its ExecuteAsync yields at the first real await", so the synchronous RunPurge() prelude blocks the startup thread — is incorrect for the .NET version this project targets. The module targets net10.0. Since .NET 8, BackgroundService.StartAsync no longer runs ExecuteAsync inline on the calling (host startup) thread: ExecuteAsync is dispatched onto the thread pool, and StartAsync returns immediately regardless of how long the synchronous prelude runs.

This was verified empirically. A standalone BackgroundService whose ExecuteAsync prelude does a 1.5 s synchronous Thread.Sleep before its first await showed StartAsync returning in 0 ms, with the prelude running on a different thread pool thread than the caller. Applied to EventLogPurgeService, StartAsync returns promptly and the initial RunPurge() executes on the background scheduler — host startup and the /health/ready gate are not blocked. There is therefore no defect to fix.

Resolution

Won't Fix — 2026-05-17 (commit <pending>). Re-triaged: the asserted startup-thread block cannot occur on .NET 8+ (this module targets net10.0), where BackgroundService dispatches ExecuteAsync to the thread pool rather than running its synchronous prelude on the host startup thread — verified empirically (see the re-triage note). No code change made. A verification test StartAsync_DoesNotBlock_OnTheInitialPurge was added to pin this behaviour (asserts StartAsync returns in under 1 s and the initial purge still runs on the background scheduler).

SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness


Severity	Medium
Category	Performance & resource management
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63`

Resolution (2026-05-28): Switched SiteEventLogger._writeQueue to Channel.CreateBounded<PendingEvent> with FullMode = BoundedChannelFullMode.DropOldest. Default capacity 10,000 via new SiteEventLogOptions.WriteQueueCapacity. The itemDropped callback faults the dropped event's Task with InvalidOperationException and increments FailedWriteCount so both awaiting callers and Health Monitoring observe drops. DropOldest preserves the SiteEventLogging-005 "callers never block" guarantee. Test: LogEventAsync_BoundedQueueDropsOldest_AndFaultsDroppedTask.

Description

SiteEventLogger creates its background-writer feeder as Channel.CreateUnbounded<PendingEvent>(...). The writer thread funnels every write through the shared _writeLock (acquired by WithConnection), so any condition that makes a single iteration slow — a long-running query in EventLogQueryService holding the lock, a PurgeByStorageCap run that takes the lock for batched DELETE + PRAGMA incremental_vacuum, a disk stall, or a sustained event burst from an alarm storm / script failure loop — drives the queue arbitrarily large. Every queued PendingEvent retains its TaskCompletionSource and its payload strings, so there is no upper bound on how much memory the recorder can hold.

The sister centralized-audit component ScadaLink.AuditLog/Site/SqliteAuditWriter.cs addresses the same hot-path-writer problem with Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait }), giving back-pressure to producers. Site event logging picked the riskier choice for a component that — per the design — is fed by every site subsystem (script, alarm, deployment, DCL, store-and-forward, instance lifecycle, notification) and has both a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.

Recommendation

Switch to Channel.CreateBounded<PendingEvent>(...) with a configurable capacity (default in the order of 10 000 — large enough to absorb a normal alarm burst, small enough to bound memory). Pick a FullMode that matches policy: Wait for back-pressure (callers await and serialise their actor thread on the queue — defeats some of the SiteEventLogging-005 win but is safe), or DropOldest / DropWrite with a counter (drop-and-count is closer to "best-effort audit"). Add the dropped-event counter to FailedWriteCount or a sibling metric. Document the chosen policy on ISiteEventLogger.LogEventAsync.

SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps


Severity	High
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78`

Resolution — EventLogQueryService.ExecuteQuery now calls .ToUniversalTime() on request.From/request.To before ToString("o"), so the produced ISO 8601 string always ends in +00:00 and lexicographically matches the UTC timestamps written by SiteEventLogger. EventLogPurgeService.PurgeByRetention was also made defensive with an explicit .ToUniversalTime() on the cutoff. A regression test (Query_FiltersByTimeRange_HandlesNonUtcOffset) constructs a +05:00 DateTimeOffset and asserts the matching UTC-stored events are returned and out-of-range ones are excluded.

Description

Event rows are persisted with timestamp = DateTimeOffset.UtcNow.ToString("o"), which always emits the round-trip ISO 8601 form ending in the literal offset +00:00 (e.g. 2026-05-28T12:34:56.7890123+00:00). The query path filters by range using a direct string comparison:

whereClauses.Add("timestamp >= $from");
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));

request.From is a DateTimeOffset? and ToString("o") preserves whatever offset the caller passed in. If a central client passes a non-UTC DateTimeOffset — for example the result of DateTimeOffset.Now in a UTC+05:00 timezone — the produced string is "2026-05-28T17:34:56.0000000+05:00", which is lexicographically greater than the equivalent UTC instant string "2026-05-28T12:34:56.0000000+00:00". The comparison timestamp >= $from is then evaluated as a byte-by-byte string compare (SQLite default BINARY collation), so the query either spuriously excludes events that genuinely occurred in the range, or spuriously includes events from a wholly different hour. The same defect applies to To. The retention purge does DateTimeOffset.UtcNow.AddDays(-N).ToString("o") (UTC) so it is safe; only the central query path is vulnerable.

The design explicitly states "All timestamps are UTC throughout the system" but the boundary between a central DateTimeOffset and the SQLite store is not enforced. A central UI rendered in a non-UTC timezone is the most likely trigger, and the defect silently corrupts every query that filters by time range — exactly the filter most likely to be set on a "show me what happened around the failover" query.

Recommendation

Normalise From / To to UTC before serialising: request.From.Value.ToUniversalTime().ToString("o") (or .UtcDateTime.ToString("o")), so the produced offset is always +00:00. Add a regression test that filters with a DateTimeOffset carrying a non-zero offset and asserts the matching events are returned. Optionally also store timestamps as Unix-epoch INTEGER and let SQLite compare numerically, eliminating the lexicographic-comparison hazard structurally.

SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale


Severity	Medium
Category	Security
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18`

Resolution (2026-05-28): EventLogQueryService.ExecuteQuery now clamps pageSize to new SiteEventLogOptions.MaxQueryPageSize (default 500, matching existing QueryPageSize default) before issuing the SQL. Silent clamp rather than reject so misconfigured clients still get a usable response. Test: Query_PageSize_IsClampedToMaxQueryPageSize (asserts a request of 100,000 is clamped down with HasMore = true).

Description

EventLogQueryService.ExecuteQuery resolves the effective page size as var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize; and uses it directly as the SQL LIMIT $limit (passing pageSize + 1 to detect "has more"). There is no upper bound. A central client — buggy or hostile — can send PageSize = int.MaxValue, in which case the query attempts to materialise the entire (up to 1 GB) event log into a single List<EventLogEntry> while holding the shared write lock. This:

Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster message serialisation limits, will then be serialised into an EventLogQueryResponse and pushed over the ClusterClient pipe.
Blocks all writes (purge, recorder hot path) for the duration of the scan because the read holds _writeLock.
Stalls the singleton EventLogHandlerActor, also blocking subsequent legitimate queries.

The design explicitly justifies pagination as preventing exactly this — "Results are paginated with a configurable page size (default: 500 events) ... This prevents broad queries from overwhelming the communication channel." The code honours the default but does not enforce an upper bound on a client-supplied override.

Recommendation

Clamp pageSize to a configurable maximum (e.g. SiteEventLogOptions.MaxQueryPageSize, default 5000) before using it. Also bound KeywordFilter.Length (e.g. 256 chars) — a leading-wildcard LIKE of an unbounded pattern is itself an expensive operation that runs under the same lock. Add a Success: false, ErrorMessage: "PageSize exceeds maximum" reject path so a misbehaving central is told why its query is refused.

SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring


Severity	Low
Category	Documentation & comments
Status	Open
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226`

Description

SiteEventLogger.FailedWriteCount was added under SiteEventLogging-008 with the XML doc statement "Surfaced so Health Monitoring can detect a logging outage instead of relying on a local log line nobody is watching." The implementation is correct (Interlocked.Increment on write failure, Interlocked.Read getter), but a repo-wide search shows no caller anywhere in src/ reads the property — neither ScadaLink.HealthMonitoring, the central health collector, nor the host's /health endpoint. The metric is dead-letter: a logging outage still goes unnoticed in production, contradicting the original finding's resolution claim.

The property is also exposed only on the concrete SiteEventLogger, not on ISiteEventLogger, so even if Health Monitoring were wired up it would have to take a concrete-type dependency (internal Connection removed, but FailedWriteCount remained concrete-only).

Recommendation

Either (a) wire FailedWriteCount into the existing Health Monitoring metric pipeline (e.g. publish it alongside other 30-second-interval site metrics, and promote a sustained non-zero value to a Warning), and add it to ISiteEventLogger so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by softening the XML doc to "Available for future Health Monitoring integration" and file a tracking item for the wiring. The current doc claim is misleading.

SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"


Severity	Low
Category	Design-document adherence
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:57-95`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:30-39`, `docs/requirements/Component-SiteEventLogging.md:45`

Description

AddSiteEventLogging calls services.AddHostedService<EventLogPurgeService>(), which registers the purge BackgroundService per host. On a 2-node site cluster both node-a and node-b start the service independently, so each runs its own 30-day retention purge and 1 GB cap purge against its own local site_events.db. The design states only "A daily background job runs on the active node and deletes all events older than 30 days." (Component-SiteEventLogging, Storage section). In practice the standby node receives no writes, so its purge finds nothing to delete and is harmless — but the implementation does not match the documented "active node" gating, and the resolution note on SiteEventLogging-004 already flagged that the writer runs on the standby too. The purge has the same shape.

Aligning to the design is also a defence against a future change that does write to the standby (e.g. local heartbeats), and removes the per-node wake-ups that contribute to Microsoft.Extensions.Hosting shutdown latency.

Recommendation

Either (a) gate the purge service on "this node is the active member of siteRole" (check the cluster singleton ownership before each RunPurge(), or host the purge inside the same cluster singleton as EventLogHandlerActor), or (b) reword the design doc to "the purge runs on every node against its own local database; on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code defect.

Resolution (2026-05-28): Took option (a) at the loop level — registration stays unchanged on every host. Introduced a SiteEventLogActiveNodeCheck delegate that EventLogPurgeService consults at the top of every RunPurge() tick; standby returns early with a debug log. The DI factory resolves the delegate from the container so the Host can register the real check on a site node, and a null/unregistered delegate falls back to the prior "always run" behaviour (backward compatible for non-clustered hosts and existing tests). Defensive try/catch around the check defaults to "run" so a transient cluster-state read failure cannot stop the purge loop. Added tests RunPurge_OnStandbyNode_SkipsAllWork, RunPurge_OnActiveNode_RunsTheRetentionPurge, and RunPurge_WithNullCheck_FallsBackToRunning. Wiring the real check on the Host's site-role branch is left for the Host's review. Tests green (50/50 in SiteEventLogging.Tests).

SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15`

Resolution (2026-05-28): LogEventAsync now validates severity against the closed set {Info, Warning, Error} (case-sensitive, matches SQLite BINARY collation used by the query filter) and throws ArgumentException naming the offending value. eventType left intentionally free-form (design enumerates an open category set). Tests: LogEventAsync_ThrowsOnUnknownSeverity (5 cases) and LogEventAsync_AcceptsAllDocumentedSeverities (3 cases).

Description

LogEventAsync validates eventType and severity only for non-empty/non-whitespace. The XML doc enumerates the allowed values: eventType ∈ {script, alarm, deployment, connection, store_and_forward, instance_lifecycle}, severity ∈ {Info, Warning, Error}. Nothing in the code enforces either set. Any caller can pass "SCRIPT", "Script", "warn", "ERR", or a typo and the row is inserted verbatim. Two follow-on consequences:

The EventLogQueryService.Severity filter is severity = $severity (exact match, case-sensitive by SQLite default BINARY collation). A row stored as "error" will not be returned for a query filtering on "Error". The design lists severity as a first-class filter and the central UI will reasonably normalise to one casing — every row stored with a different casing is silently invisible to that filter.
The Events Logged table in the design implicitly relies on a stable event_type enumeration to drive UI grouping; a typo'd event_type slips in silently and is hard to detect later.

Recommendation

Validate eventType and severity against a known set (or accept enums on the interface, converting to canonical string at the call site). Reject unknown values with ArgumentException and log a single-shot warning during construction if a deployment is found to be using an unexpected value. Alternatively, normalise casing (severity = severity.ToLowerInvariant()) so the query filter is case-insensitive. Update the XML doc to match the enforced contract.

SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales


Severity	Low
Category	Correctness & logic bugs
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138`

Resolution (2026-05-28): EventLogQueryService.ExecuteQuery now parses the stored ISO 8601 timestamp with CultureInfo.InvariantCulture plus DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal, so the result is locale-independent and guaranteed UTC. No new test was added — the recorder still writes via DateTimeOffset.UtcNow.ToString("o") which is itself invariant-safe in practice, and the existing query/time-range tests continue to pass.

Description

ExecuteQuery materialises rows via DateTimeOffset.Parse(reader.GetString(1)). DateTimeOffset.Parse(string) uses CultureInfo.CurrentCulture and DateTimeStyles.None. The stored format is ISO 8601 round-trip ("o"), which is usually parseable in any culture — but a production node running with a non-default culture (e.g. Turkish "tr-TR", which has historically broken case-insensitive ASCII comparisons via the "Turkish-I" issue, or any culture that overrides the date/time separators) can parse incorrectly or throw FormatException. The exception is caught by the outer try, so the entire query is converted to a Success: false response — but the failure mode is silent and culture-dependent.

The recorder side stores via DateTimeOffset.UtcNow.ToString("o"), which is also culture-sensitive in the same way; on a hostile-culture node, the round-trip between insert and query is not guaranteed to be lossless without explicit culture pinning.

Recommendation

Parse with explicit invariant culture and round-trip style: DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture, DateTimeStyles.RoundtripKind) (and the same for the ToString("o", InvariantCulture) emitters in SiteEventLogger.LogEventAsync and EventLogPurgeService.PurgeByRetention). Alternatively switch the schema to store timestamp as Unix-epoch INTEGER and avoid all string-parsing.

SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger


Severity	Low
Category	Code organization & conventions
Status	Resolved
Location	`src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52`

Resolution (2026-05-28): Dropped Cache=Shared from the default connection string in SiteEventLogger's ctor — the logger owns exactly one SqliteConnection serialised under _writeLock, so the cross-connection shared-cache mode is dormant. The connectionStringOverride ctor parameter still lets a test path inject Cache=Shared explicitly when needed. No test relied on the default option (grep across tests/ScadaLink.SiteEventLogging.Tests finds zero references). Behaviour unchanged for production callers.

Description

The connection string is built as $"Data Source={options.Value.DatabasePath};Cache=Shared". SQLite's shared-cache mode is a cross-connection optimisation: it lets multiple SqliteConnections in the same process share an in-process page cache. This logger owns exactly one SqliteConnection and serialises all access through _writeLock, so Cache=Shared cannot share with anything — the mode is dormant. At best it is dead configuration; at worst it adds (very small) per-statement lock overhead inside SQLite. The sister SqliteAuditWriter carries the same unused option, so the smell is a copy-and-paste pattern.

Shared-cache mode also subtly changes the semantics of PRAGMA busy_timeout and PRAGMA locking_mode, so leaving it on while not using it is a small future-foot gun if anyone later opens a second connection to the same file from another component on the same host (e.g. a tooling read-only viewer).

Recommendation

Drop Cache=Shared from the connection string — the logger is single-connection and gains nothing from it. If a future need to share the DB across connections in the same process arises, reintroduce it deliberately together with the busy_timeout and locking_mode review that should accompany it.

SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag


Severity	Low
Category	Testing coverage
Status	Resolved
Location	`tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308`

Resolution (2026-05-28): Promoted the test-local bool stop to a volatile bool _stop field on EventLogPurgeServiceTests (with a doc-comment cross-referencing this finding). Every writer thread now observes the main thread's _stop = true flip without relying on JIT/runtime quirks across the await _eventLogger.LogEventAsync boundary, so the regression test for SiteEventLogging-003 can no longer hang past xUnit's per-test timeout in release builds. CancellationTokenSource was considered (canonical pattern) but volatile bool is the minimal-diff fix consistent with the existing structure.

Description

PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection uses a plain bool stop = false; that the main test thread mutates after the purge task completes (stop = true;) while four background writer tasks are spin-checking while (!stop). The flag is not declared volatile, not wrapped in Volatile.Read/Volatile.Write, and not behind a memory barrier. On a release build with a relaxed memory model the writer threads are permitted to cache the stop = false read indefinitely, which means in theory the test can hang past xUnit's per-test timeout instead of asserting Empty(exceptions). The test relies on observed JIT/runtime behaviour that today happens to refresh the field across the await _eventLogger.LogEventAsync boundary, but that is an implementation detail rather than a contract.

The test is a regression test for SiteEventLogging-003; a flaky / hang-prone version of it can mask the very behaviour it is meant to pin.

Recommendation

Use a CancellationTokenSource (while (!cts.IsCancellationRequested)), or change stop to a volatile bool, or use Interlocked.Exchange / Volatile.Read. CancellationTokenSource is the canonical .NET pattern and also lets the test cooperate with xUnit's Task.WhenAll timeout.

63 KiB Raw Blame History

Code Review — SiteEventLogging

Summary

Re-review 2026-05-17 (commit 39d737e)

Re-review 2026-05-28 (commit 1eb6e97)

Checklist coverage

Findings

SiteEventLogging-001 — PRAGMA incremental_vacuum is a no-op; storage cap cannot reclaim space

SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed

SiteEventLogging-003 — Shared SqliteConnection used by purge and query without the write lock

SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node

SiteEventLogging-005 — LogEventAsync performs synchronous disk I/O on the caller's thread

SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths

SiteEventLogging-007 — ISiteEventLogger consumers downcast to the concrete type

SiteEventLogging-008 — Event-recording write failures are silently swallowed

SiteEventLogging-009 — XML doc on LogEventAsync claims asynchronous behaviour

SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path

SiteEventLogging-011 — Stale "Phase 4+" placeholder in ServiceCollectionExtensions

SiteEventLogging-012 — Dropped events report success: Task is completed, not faulted, when the event cannot be persisted

SiteEventLogging-013 — Keyword search does not escape SQL LIKE wildcards in user input

SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread

SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness

SiteEventLogging-016 — From/To filters compare non-normalised ISO 8601 strings against UTC-stored timestamps

SiteEventLogging-017 — Central client's PageSize is unbounded; defeats the "configurable page size" design rationale

SiteEventLogging-018 — FailedWriteCount is exposed but never consumed by Health Monitoring

SiteEventLogging-019 — EventLogPurgeService runs on every host node; design says "active node"

SiteEventLogging-020 — severity and eventType are unvalidated free-form strings; doc enumerates a set that is not enforced

SiteEventLogging-021 — DateTimeOffset.Parse uses the current culture; can throw on non-default locales

SiteEventLogging-022 — Cache=Shared is redundant for a single-connection logger

SiteEventLogging-023 — Concurrent-stress test uses a non-volatile stop flag

63 KiB

Raw Blame History

Re-review 2026-05-17 (commit `39d737e`)

Re-review 2026-05-28 (commit `1eb6e97`)

SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space

SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock

SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread

SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type

SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour

SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`

SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted

SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input

SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps

SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale

SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring

SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"

SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced

SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales

SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger

SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag