37 KiB
Code Review — SiteEventLogging
| Field | Value |
|---|---|
| Module | src/ScadaLink.SiteEventLogging |
| Design doc | docs/requirements/Component-SiteEventLogging.md |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | 39d737e |
| Open findings | 0 |
Summary
The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
recorder (SiteEventLogger), a query service with keyset pagination, a background
purge service, and a thin Akka actor bridge. The query path is parameterised
correctly (no SQL injection) and reasonably well tested. However, the storage-cap
enforcement is functionally broken: PRAGMA incremental_vacuum is a no-op because
auto_vacuum = INCREMENTAL is never set, so the cap-purge loop never sees the
database shrink and over-deletes the entire table when triggered. There is also a
genuine concurrency hazard: the purge service and query service share the single
SqliteConnection owned by SiteEventLogger but bypass its _writeLock, so a purge
running on the background thread can collide with a write or a query on another
thread. The LogEventAsync API is synchronous despite its name and Task return,
which silently blocks Akka actor threads on disk I/O. Other findings concern the
cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
Re-review 2026-05-17 (commit 39d737e)
Re-reviewed the module at commit 39d737e. All eleven prior findings remain closed
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
hold up under inspection — the background writer, lock-guarded WithConnection,
auto_vacuum = INCREMENTAL plus logical-size measurement, the severity index, and
the concrete-recorder DI wiring are all present and correct at this commit. The
module source is byte-identical between 39d737e and current HEAD, so this review
reflects the live code. Three new findings were recorded, all low-to-medium and none
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
be persisted because the logger has been disposed, the returned Task is completed
successfully rather than faulted, so an await-ing caller is told a dropped audit
event was written. The other two are minor: unescaped SQL LIKE wildcards in the
keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block on the
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
premise does not hold on .NET 8+).
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | incremental_vacuum no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); LIKE wildcards unescaped in keyword search (-013). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). Ask boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared SqliteConnection used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | LogEventAsync swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: claimed initial-purge startup-thread block (-014 — Won't Fix, premise invalid on .NET 8+). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of ISiteEventLogger (-007); internal Connection leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | LogEventAsync XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
Findings
SiteEventLogging-001 — PRAGMA incremental_vacuum is a no-op; storage cap cannot reclaim space
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55 |
Description
PurgeByStorageCap issues PRAGMA incremental_vacuum after each delete batch to
reclaim space, then re-measures the database size via page_count * page_size.
incremental_vacuum only has any effect when the database was created with
auto_vacuum = INCREMENTAL. InitializeSchema never sets auto_vacuum, so the
database uses the SQLite default (auto_vacuum = NONE). With NONE,
incremental_vacuum is silently ignored and page_count does not decrease when
rows are deleted (free pages are retained in the file). Consequently the
while (currentSizeBytes > capBytes) loop never observes the size dropping. The
storage-cap feature required by the design ("configurable maximum database size...
oldest events are purged first") is therefore non-functional — it cannot bring the
file back under the cap.
Recommendation
Set PRAGMA auto_vacuum = INCREMENTAL in InitializeSchema before any tables are
created (it must be set before table creation or followed by a full VACUUM to take
effect on an existing database). Alternatively, run a full VACUUM after cap-purge
deletes, or measure logical data size (e.g. page_count - freelist_count times
page_size) instead of relying on incremental_vacuum.
Resolution
Resolved 2026-05-16 (commit <pending>): InitializeSchema now sets
PRAGMA auto_vacuum = INCREMENTAL before any table is created, and
GetDatabaseSizeBytes measures logical size as (page_count - freelist_count) * page_size so reclaimed pages no longer mask the size drop. The cap-purge loop now
reliably observes the database shrinking. Regression test
PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable.
SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105 |
Description
Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
so currentSizeBytes stays above capBytes. The loop then keeps deleting 1000-row
batches on every iteration until ExecuteNonQuery returns 0 — i.e. until the table
is completely empty. The design states the cap should purge "the oldest events...
first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
during an alarm storm) this destroys all retained diagnostic history rather than
trimming it to the budget. The unit test PurgeByStorageCap_DeletesOldestWhenOverCap
masks the problem because it uses MaxStorageMb = 0, which legitimately expects an
empty table, so the over-delete behaviour is never exercised against a realistic cap.
Recommendation
Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
the file is genuinely under the cap. Add a guard so the loop stops once
currentSizeBytes has stopped decreasing across iterations, and add a test with a
non-zero cap and a known oversized dataset to assert that only the oldest events are
removed.
Resolution
Resolved 2026-05-16 (commit <pending>): with the size measurement fixed
(SiteEventLogging-001) the cap loop terminates when the file is genuinely under the
cap. An additional guard stops the loop if the on-disk size fails to decrease across
an iteration, so a cap that can never be met no longer empties the whole table.
Regression tests PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable (asserts the
table is not emptied and the file ends under a realistic non-zero cap) and
PurgeByStorageCap_RemovesOldestEventsFirst (asserts only the oldest events are
removed).
SiteEventLogging-003 — Shared SqliteConnection used by purge and query without the write lock
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114, src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72 |
Description
SiteEventLogger owns a single SqliteConnection and serialises its own writes via
lock (_writeLock). EventLogPurgeService and EventLogQueryService both reach
into _eventLogger.Connection and execute commands directly, without acquiring
_writeLock. The purge runs on a BackgroundService thread (a different thread from
event-recording callers and from the actor that drives the query service). A single
SqliteConnection / SqliteCommand is not thread-safe; concurrent use from the
purge thread and a recording thread (or query thread) can throw
SqliteException/InvalidOperationException ("DataReader already open",
"connection busy") or corrupt command state. The purge DELETE and the recorder
INSERT racing is the most likely collision because event recording is continuous.
Recommendation
Funnel all access to the connection through a single synchronisation point: either
expose lock-guarded methods on SiteEventLogger for purge/query to call, or give the
purge and query services their own dedicated SqliteConnection instances (SQLite
supports multiple connections to the same file; Cache=Shared plus a busy_timeout
makes this safer). Do not share one SqliteConnection across threads.
Resolution
Resolved 2026-05-16 (commit <pending>): the raw internal Connection property was
removed from SiteEventLogger and replaced with lock-guarded WithConnection(...)
overloads that hold the existing _writeLock for the duration of the caller's
delegate. EventLogPurgeService, EventLogQueryService, and LogEventAsync all now
access the connection exclusively through WithConnection, so the purge thread,
query thread, and recording threads are serialised on a single lock. Dispose was
also brought under the lock to avoid a dispose/use race. Regression test
PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection exercises purge running
concurrently with multiple writer threads.
SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
| Severity | Low |
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
| Category | Design-document adherence |
| Status | Won't Fix |
| Location | src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336, src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25 |
Description
EventLogHandlerActor is hosted as a ClusterSingletonManager singleton with the
stated intent that "queries always reach the active node". However, an Akka.NET
cluster singleton is pinned to the oldest member of the role, which is not the
same concept as the SCADA "active node" (the node currently running the Deployment
Manager singleton / serving live traffic). The design doc is explicit: "Only the
active node generates and stores events... the new active node starts logging to its
own SQLite database." The event-log SQLite file is node-local and unreplicated.
Nothing guarantees the event-log singleton co-locates with the active node, so a
remote query can be served by the standby node and read that node's near-empty
database, returning no events even though the active node has a full log. The
explanatory comment in AkkaHostedService.cs asserts the opposite of what actually
happens.
Recommendation
Either (a) host the query handler as a normal per-node actor and route queries to the active node explicitly (the node owning the Deployment Manager singleton), or (b) make the event-log writer follow the same singleton so the writer and the query handler are guaranteed co-located. Reconcile the design doc and the inline comment with whichever model is chosen.
Re-triage note (2026-05-16)
The finding's central claim — that a remote query "can be served by the standby
node and read that node's near-empty database" — is incorrect for the query path.
In AkkaHostedService.cs the event-log-handler ClusterSingletonManager and the
deployment-manager ClusterSingletonManager are created with the same role
(siteRole) in the same cluster. Akka.NET pins every cluster singleton of a
given role to the oldest member of that role — so all same-role singletons in a
cluster co-locate on one node. The "active node" in this codebase is, by definition,
the node hosting the deployment-manager singleton; the event-log query singleton
is therefore guaranteed to run on that same node. A ClusterClient query cannot
land on the standby. The inline comment in AkkaHostedService.cs is accurate, not
"the opposite of what happens".
A real but distinct concern exists: the writer (SiteEventLogger) is registered
as a plain per-node DI singleton (AddSiteEventLogging), so it records to a local
SQLite file on every node, including the standby. That wastes storage on the
standby but does not cause the query-returns-nothing symptom the finding
describes, because the query singleton always reads the active node's (populated)
database. Gating the writer to the active node would be a ScadaLink.Host wiring
change, outside this module's scope, and is a minor optimisation rather than a
correctness defect.
Re-triaged from High to Low and closed as Won't Fix: the High-severity
correctness claim does not hold. Any residual cleanup (gate the standby-node writer;
the comment needs no change) can be raised as a fresh Low finding against
ScadaLink.Host if desired.
Resolution
Won't Fix — 2026-05-16 (commit <pending>). Re-triaged: the asserted defect (query
served by standby returning an empty log) cannot occur because the event-log query
singleton and the deployment-manager singleton share a role and so always co-locate
on the active node. No code change made; see the re-triage note above.
SiteEventLogging-005 — LogEventAsync performs synchronous disk I/O on the caller's thread
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99 |
Description
LogEventAsync is declared async-shaped (returns Task, Async suffix) but its
body is entirely synchronous: it takes lock (_writeLock), runs
cmd.ExecuteNonQuery() (a blocking SQLite write), then returns Task.CompletedTask.
Callers across the codebase invoke it fire-and-forget as _ = LogEventAsync(...)
(e.g. ScriptExecutionActor.cs:133, DataConnectionActor.cs:292,
ScriptActor.cs:250) expecting it to be non-blocking. In reality the SQLite write,
and any contention on _writeLock, executes inline on the Akka actor thread of the
calling subsystem. Under an event burst (alarm storm, script failure loop) this
serialises actor threads on disk I/O and the global write lock, degrading the
hot-path subsystems the design intends to keep responsive.
Recommendation
Either make recording genuinely asynchronous (offload to a dedicated single-threaded
writer / Channel<T> consumer so callers truly fire-and-forget), or rename the
method to LogEvent and document that it blocks, so callers can decide. Given the
design's emphasis on not impacting runtime subsystems, an internal queue with a
background flush is preferable.
Resolution
Resolved 2026-05-16 (commit pending): event recording is now offloaded to a
dedicated background writer. SiteEventLogger owns an unbounded Channel<T> and a
single background consumer thread; LogEventAsync only validates its arguments and
enqueues, so caller threads (Akka actor threads on hot paths) never block on the
SQLite write or on contention for the write lock. The returned Task completes once
the event is durably persisted (so await callers still observe write ordering) and
faults if the write fails. Dispose completes the channel and drains the writer.
Regression test LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow (verifies the
caller returns in <500 ms while the database is held busy) plus
LogEventAsync_TaskCompletes_AfterEventIsPersisted.
SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52, src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81 |
Description
InitializeSchema creates indexes on timestamp, event_type, and instance_id.
The query service also filters on severity (severity = $severity) and performs
message LIKE '%...%' / source LIKE '%...%' keyword search. severity has no
index, and a leading-wildcard LIKE cannot use a normal index at all. With up to a
1 GB database and a 500-row page size, severity-filtered and keyword queries do full
table scans on every page. The design explicitly lists keyword search as a supported,
expected query type.
Recommendation
Add an index on severity (or a composite index aligned with common filter
combinations such as (event_type, severity, id)). For keyword search, consider an
FTS5 virtual table over message and source, or accept the scan but document the
cost.
Resolution
Resolved 2026-05-16 (commit pending): InitializeSchema now creates
idx_events_severity on site_events(severity), so severity-filtered queries use an
index instead of full-scanning. The leading-wildcard LIKE keyword search cannot use
a B-tree index by construction; rather than adding an FTS5 table, that scan is
accepted and now documented with an inline comment next to the index DDL. Regression
tests Schema_HasIndexOnSeverity and SeverityFilteredQuery_UsesIndex_NotFullScan
(the latter asserts the SQLite query plan uses idx_events_severity and does not
SCAN site_events).
SiteEventLogging-007 — ISiteEventLogger consumers downcast to the concrete type
| Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30, src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28, src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23 |
Description
Both EventLogPurgeService and EventLogQueryService took ISiteEventLogger via
DI and immediately downcast it: _eventLogger = (SiteEventLogger)eventLogger;. This
made the registration fragile — any ISiteEventLogger that is not exactly
SiteEventLogger (a test double, a decorator) caused an InvalidCastException at
construction — and defeated the purpose of the interface abstraction.
Re-triage note (2026-05-16)
The finding as originally written also claimed the services "access the internal SqliteConnection Connection property to run arbitrary SQL" and called itself "the
root cause of the unsynchronised connection sharing in SiteEventLogging-003". That
part is stale: the resolution of SiteEventLogging-003 had already removed the
internal Connection property and replaced it with lock-guarded WithConnection
overloads. At the time this finding was actioned, the only remaining defect was the
concrete-type downcast itself. Severity stays Medium; the description is corrected to
the downcast-only scope.
Recommendation
Have the purge and query services depend on the concrete SiteEventLogger directly
(it is the type that owns the lock-guarded WithConnection), and register the
concrete type in DI with the interface forwarded to the same singleton. Remove the
fragile downcasts.
Resolution
Resolved 2026-05-16 (commit pending): AddSiteEventLogging now registers
SiteEventLogger as a concrete singleton and forwards ISiteEventLogger to that
same instance. EventLogPurgeService and EventLogQueryService take a
SiteEventLogger constructor parameter directly, eliminating the
(SiteEventLogger)eventLogger downcast and its InvalidCastException risk. All
three services still share one connection/lock. Regression tests
AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance and
PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast.
SiteEventLogging-008 — Event-recording write failures are silently swallowed
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95 |
Description
If ExecuteNonQuery throws (disk full, database locked, file corruption), the
exception is caught, written to ILogger, and discarded; LogEventAsync still
returns Task.CompletedTask as if successful. Callers fire-and-forget the result so
they cannot detect failure. The event log is the site's diagnostic audit trail; a
sustained write failure (for example a locked-database storm caused by the
unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
operators except a local log line that nobody is watching. There is no failure
counter, no health-metric hook, and no retry.
Recommendation
Expose a failure signal: increment a counter that the Health Monitoring component can surface (the design notes script/alarm error rates are derived from the event log — a logging outage should be visible). At minimum, escalate repeated failures to a Warning/Error health metric rather than only a local log line.
Resolution
Resolved 2026-05-16 (commit pending): write failures are no longer swallowed. The
background writer (introduced for SiteEventLogging-005) now, on an INSERT failure,
(a) increments a new Interlocked-guarded counter exposed as the public
SiteEventLogger.FailedWriteCount property — which Health Monitoring can poll to
detect a logging outage — and (b) faults the Task returned by LogEventAsync with
the exception instead of returning Task.CompletedTask. The error is still logged.
Regression test LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError.
SiteEventLogging-009 — XML doc on LogEventAsync claims asynchronous behaviour
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57 |
Description
The interface XML doc states "Record an event asynchronously." and the method is
named LogEventAsync, but the implementation is fully synchronous (see
SiteEventLogging-005). The documentation and naming are misleading: a reader will
reasonably assume the write is offloaded and the caller's thread is not blocked,
which is false. The details parameter doc says "Optional JSON details" but nothing
validates or requires JSON, so callers may pass arbitrary text.
Re-triage note (2026-05-16)
The finding's central premise is now stale. SiteEventLogging-005 was resolved by
introducing a dedicated background writer: LogEventAsync enqueues onto an unbounded
Channel<T> and returns without blocking, so the method is now genuinely
asynchronous and the name LogEventAsync is accurate. No rename is warranted. The
only residual documentation defect was the imprecise XML doc (the one-line summary
did not describe the enqueue/await semantics) and the details doc claiming "JSON"
when nothing enforces it. Scope corrected to those doc fixes only.
Recommendation
Align the name, signature, and documentation with the actual behaviour — either make
the method genuinely asynchronous or rename to LogEvent and correct the doc.
Clarify that details is free-form text unless JSON is actually enforced.
Resolution
Resolved 2026-05-16 (commit pending): the ISiteEventLogger.LogEventAsync XML doc
now describes the actual behaviour — the call enqueues onto a background writer and
returns without blocking, and the returned Task completes on durable persistence
(or faults on write failure). The details parameter doc was corrected to "Optional
free-form detail text ... JSON is conventional but not validated or enforced". No
rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so
LogEventAsync is accurate (see Re-triage note). Documentation-only change; no
regression test added (behaviour is already covered by the SiteEventLogging-005
tests LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow and
LogEventAsync_TaskCompletes_AfterEventIsPersisted).
SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.SiteEventLogging.Tests/ |
Description
The test suite covers recording, query filtering/pagination, and basic purge, but several critical behaviours are untested:
EventLogHandlerActorhas no test — the actor message contract (EventLogQueryRequest->EventLogQueryResponse,Sender.Tell) is unverified.- No test exercises purge running concurrently with active writes/queries, so the connection-sharing race (SiteEventLogging-003) is invisible to CI.
PurgeByStorageCapis only tested withMaxStorageMb = 0, which hides the no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the file shrinks or that only oldest events are removed under a realistic cap.EventLogQueryService.ExecuteQuery's catch block (Success: false,ErrorMessage) has no test.SiteEventLogger.Disposesemantics (logging after dispose returnsTask.CompletedTask) and re-entrant dispose are untested.
Recommendation
Add tests for the actor bridge, a concurrency stress test (purge + write + query in parallel), a realistic non-zero-cap purge test asserting size reduction and oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).
Resolution
Resolved 2026-05-16 (commit pending): the remaining coverage gaps are now filled.
EventLogHandlerActorTests (using Akka.TestKit.Xunit2) exercises the actor message
contract — EventLogQueryRequest -> EventLogQueryResponse via Sender.Tell, for
both success and error responses. EventLogCoverageTests covers the previously
untested EventLogQueryService.ExecuteQuery catch block
(ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable) and the recorder's
disposed-state semantics (LogEventAsync_AfterDispose_CompletesWithoutThrowing,
Dispose_IsIdempotent). The purge/write concurrency, realistic-cap, and
oldest-first behaviours were already covered by the tests added when
SiteEventLogging-001/-002/-003 were resolved
(PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection,
PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable,
PurgeByStorageCap_RemovesOldestEventsFirst).
SiteEventLogging-011 — Stale "Phase 4+" placeholder in ServiceCollectionExtensions
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22 |
Description
AddSiteEventLoggingActors is an empty method with a comment "Placeholder for Akka
actor registration (Phase 4+)". The actor (EventLogHandlerActor) is in fact already
implemented and is registered directly in AkkaHostedService.cs:313-336, not through
this method. The placeholder is dead code: it is either never called or called with
no effect, and the comment is stale. A reader looking for where the event-log actor
is wired up will be misdirected.
Recommendation
Either implement the actor registration here and have AkkaHostedService call it
(centralising the wiring), or delete AddSiteEventLoggingActors entirely and remove
the misleading comment.
Resolution
Resolved 2026-05-16 (commit pending): confirmed dead code — a repo-wide search
found zero callers of AddSiteEventLoggingActors, and EventLogHandlerActor is in
fact wired up directly in ScadaLink.Host/Actors/AkkaHostedService.cs as a cluster
singleton (it must be created inside the ActorSystem with a resolved
IEventLogQueryService, which a IServiceCollection extension cannot do). The empty
placeholder method and its stale "Phase 4+" comment were deleted, and a short
explanatory note added to AddSiteEventLogging pointing readers to where the actor
is actually registered. Documentation/dead-code change only; no regression test was
added — the change is a method removal verified by the compiler (no callers) and the
full module suite still passing.
SiteEventLogging-012 — Dropped events report success: Task is completed, not faulted, when the event cannot be persisted
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197 |
Description
LogEventAsync returns a Task that, per the interface XML doc (corrected under
SiteEventLogging-009), "completes once the event is durably persisted and faults if
the write fails, so callers that await it observe success or failure." Two paths
break that contract by signalling success for an event that was never written:
- In
LogEventAsync, if_writeQueue.Writer.TryWrite(pending)fails (the channel has been completed because the logger was disposed), the code callspending.Completion.TrySetResult()— completing theTasksuccessfully — even though the comment immediately above acknowledges "there is nowhere to persist the event." - In
ProcessWriteQueueAsync,WithConnectionreturnsfalsewhen the logger has been disposed mid-drain. The code does not inspect the returnedwrittenflag and unconditionally callspending.Completion.TrySetResult(), again reporting success for an event the comment admits "simply cannot be persisted."
The event log is the site's diagnostic audit trail. A caller that awaits
LogEventAsync to confirm a critical event (deployment applied, alarm activated) was
recorded will observe a successful completion for an event that was silently
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
errors — but the disposed-drop path was left reporting false success. The window
is the disposal/shutdown interval, during which shutdown-related events (graceful
singleton handover, instance disable) are exactly the events most likely to be
enqueued and lost.
Recommendation
For both paths, fault the Task (or complete it with a sentinel failure) instead of
TrySetResult() — e.g. pending.Completion.TrySetException(new ObjectDisposedException(...))
— so an await-ing caller can distinguish a dropped event from a persisted one.
Inspect the written flag returned by WithConnection in ProcessWriteQueueAsync
and only call TrySetResult() when written is true. Update the XML doc if a
deliberate "drop silently on shutdown" semantics is chosen instead.
Resolution
Resolved 2026-05-17 (commit <pending>): both disposed-drop paths now fault the
returned Task with ObjectDisposedException instead of TrySetResult() —
LogEventAsync on a failed TryWrite, and ProcessWriteQueueAsync now inspects the
written flag from WithConnection and only calls TrySetResult() when the row was
actually persisted. An await-ing caller can now distinguish a dropped audit event
from a persisted one. Regression tests
LogEventAsync_AfterDispose_FaultsTask_NotReportsSuccess and
LogEventAsync_EnqueuedThenDisposed_FaultsTask_WhenWriteCannotComplete (the prior
LogEventAsync_AfterDispose_CompletesWithoutThrowing test asserted the now-incorrect
silent-success behaviour and was replaced).
SiteEventLogging-013 — Keyword search does not escape SQL LIKE wildcards in user input
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83 |
Description
The keyword-search filter builds the LIKE pattern as $"%{request.KeywordFilter}%"
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
it does not neutralise the LIKE metacharacters % and _ inside the
user-supplied keyword. A search for a literal _ (common in event sources and
identifiers such as store_and_forward, PLC_1, or instance IDs) is interpreted as
"match any single character", and a % matches any run of characters. The design
calls keyword search "free-text search on message and source fields ... Useful for
finding events by script name, alarm name, or error message" — users will reasonably
expect a literal substring match, so a query for store_and_forward silently returns
events containing storeXandYforward and similar false positives. There is no way
for the caller to search for a literal underscore or percent.
Recommendation
Escape %, _, and the escape character itself in request.KeywordFilter before
wrapping it in %...%, and append an ESCAPE clause to the LIKE expression
(e.g. ... LIKE $keyword ESCAPE '\'). Alternatively document that the keyword field
accepts LIKE wildcard syntax, but a literal-substring match is the behaviour the
design implies.
Resolution
Resolved 2026-05-17 (commit <pending>): EventLogQueryService now escapes \,
%, and _ in request.KeywordFilter via a new EscapeLikePattern helper before
wrapping it in %...%, and the LIKE clauses carry an explicit ESCAPE '\' so the
keyword is matched as a literal substring as the design intends. Regression tests
Query_KeywordSearch_TreatsUnderscoreAsLiteral_NotWildcard,
Query_KeywordSearch_TreatsPercentAsLiteral_NotWildcard, and
Query_KeywordSearch_StillMatchesPlainSubstring.
SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
| Severity | Low |
| Category | Performance & resource management |
| Status | Won't Fix |
| Location | src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48 |
Description
EventLogPurgeService.ExecuteAsync calls RunPurge() (a fully synchronous method
that runs PurgeByRetention and PurgeByStorageCap) before the first await
(await timer.WaitForNextTickAsync(...)). A BackgroundService's ExecuteAsync is
invoked from StartAsync, and the host's startup pipeline does not proceed past a
BackgroundService until its ExecuteAsync yields at the first real await. Because
RunPurge() precedes any await, the entire initial purge — including a cap-purge
that deletes rows in 1000-row batches and runs PRAGMA incremental_vacuum until a
near-1 GB database is back under the cap — executes inline on the startup thread,
blocking host startup (and therefore the /health/ready gate) for as long as the
purge takes. On a site that has accumulated a large log this can be a multi-second
stall during every node start/failover. The class doc states the service "runs on a
background thread and does not block event recording" — the startup-thread block is
inconsistent with that intent.
Recommendation
Yield before the initial purge so it runs on the background scheduler rather than the
startup thread — e.g. await Task.Yield(); as the first statement of ExecuteAsync,
or move the initial RunPurge() to after the first await timer.WaitForNextTickAsync
(accepting a one-interval delay), or offload it with await Task.Run(RunPurge, stoppingToken).
Re-triage note (2026-05-17)
The finding's central premise — "the host's startup pipeline does not proceed past a
BackgroundService until its ExecuteAsync yields at the first real await", so the
synchronous RunPurge() prelude blocks the startup thread — is incorrect for the
.NET version this project targets. The module targets net10.0. Since .NET 8,
BackgroundService.StartAsync no longer runs ExecuteAsync inline on the calling
(host startup) thread: ExecuteAsync is dispatched onto the thread pool, and
StartAsync returns immediately regardless of how long the synchronous prelude runs.
This was verified empirically. A standalone BackgroundService whose ExecuteAsync
prelude does a 1.5 s synchronous Thread.Sleep before its first await showed
StartAsync returning in 0 ms, with the prelude running on a different thread
pool thread than the caller. Applied to EventLogPurgeService, StartAsync returns
promptly and the initial RunPurge() executes on the background scheduler — host
startup and the /health/ready gate are not blocked. There is therefore no defect
to fix.
Resolution
Won't Fix — 2026-05-17 (commit <pending>). Re-triaged: the asserted startup-thread
block cannot occur on .NET 8+ (this module targets net10.0), where
BackgroundService dispatches ExecuteAsync to the thread pool rather than running
its synchronous prelude on the host startup thread — verified empirically (see the
re-triage note). No code change made. A verification test
StartAsync_DoesNotBlock_OnTheInitialPurge was added to pin this behaviour
(asserts StartAsync returns in under 1 s and the initial purge still runs on the
background scheduler).