Files
scadalink-design/code-reviews/SiteEventLogging/findings.md

22 KiB

Code Review — SiteEventLogging

Field Value
Module src/ScadaLink.SiteEventLogging
Design doc docs/requirements/Component-SiteEventLogging.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 7

Summary

The SiteEventLogging module is small and broadly well-structured: a SQLite-backed recorder (SiteEventLogger), a query service with keyset pagination, a background purge service, and a thin Akka actor bridge. The query path is parameterised correctly (no SQL injection) and reasonably well tested. However, the storage-cap enforcement is functionally broken: PRAGMA incremental_vacuum is a no-op because auto_vacuum = INCREMENTAL is never set, so the cap-purge loop never sees the database shrink and over-deletes the entire table when triggered. There is also a genuine concurrency hazard: the purge service and query service share the single SqliteConnection owned by SiteEventLogger but bypass its _writeLock, so a purge running on the background thread can collide with a write or a query on another thread. The LogEventAsync API is synchronous despite its name and Task return, which silently blocks Akka actor threads on disk I/O. Other findings concern the cluster-singleton placement of the handler actor (which can pin to the standby node), missing indexes for common query filters, retention/cap purge not enforcing the requirement strictly, and several documentation/maintainability issues.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs incremental_vacuum no-op breaks cap purge (-001); over-delete on cap (-002).
2 Akka.NET conventions Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). Ask boundary is appropriate.
3 Concurrency & thread safety Shared SqliteConnection used by purge/query without the write lock (-003).
4 Error handling & resilience LogEventAsync swallows write failures silently into a log line only (-008); purge catches broadly.
5 Security Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding.
6 Performance & resource management Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006).
7 Design-document adherence Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002).
8 Code organization & conventions Concrete-type downcast of ISiteEventLogger (-007); internal Connection leaks DB handle (-007).
9 Testing coverage No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010).
10 Documentation & comments LogEventAsync XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011).

Findings

SiteEventLogging-001 — PRAGMA incremental_vacuum is a no-op; storage cap cannot reclaim space

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55

Description

PurgeByStorageCap issues PRAGMA incremental_vacuum after each delete batch to reclaim space, then re-measures the database size via page_count * page_size. incremental_vacuum only has any effect when the database was created with auto_vacuum = INCREMENTAL. InitializeSchema never sets auto_vacuum, so the database uses the SQLite default (auto_vacuum = NONE). With NONE, incremental_vacuum is silently ignored and page_count does not decrease when rows are deleted (free pages are retained in the file). Consequently the while (currentSizeBytes > capBytes) loop never observes the size dropping. The storage-cap feature required by the design ("configurable maximum database size... oldest events are purged first") is therefore non-functional — it cannot bring the file back under the cap.

Recommendation

Set PRAGMA auto_vacuum = INCREMENTAL in InitializeSchema before any tables are created (it must be set before table creation or followed by a full VACUUM to take effect on an existing database). Alternatively, run a full VACUUM after cap-purge deletes, or measure logical data size (e.g. page_count - freelist_count times page_size) instead of relying on incremental_vacuum.

Resolution

Resolved 2026-05-16 (commit <pending>): InitializeSchema now sets PRAGMA auto_vacuum = INCREMENTAL before any table is created, and GetDatabaseSizeBytes measures logical size as (page_count - freelist_count) * page_size so reclaimed pages no longer mask the size drop. The cap-purge loop now reliably observes the database shrinking. Regression test PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable.

SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105

Description

Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch, so currentSizeBytes stays above capBytes. The loop then keeps deleting 1000-row batches on every iteration until ExecuteNonQuery returns 0 — i.e. until the table is completely empty. The design states the cap should purge "the oldest events... first" to stay within budget, not wipe the whole log. When the cap is hit (e.g. during an alarm storm) this destroys all retained diagnostic history rather than trimming it to the budget. The unit test PurgeByStorageCap_DeletesOldestWhenOverCap masks the problem because it uses MaxStorageMb = 0, which legitimately expects an empty table, so the over-delete behaviour is never exercised against a realistic cap.

Recommendation

Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when the file is genuinely under the cap. Add a guard so the loop stops once currentSizeBytes has stopped decreasing across iterations, and add a test with a non-zero cap and a known oversized dataset to assert that only the oldest events are removed.

Resolution

Resolved 2026-05-16 (commit <pending>): with the size measurement fixed (SiteEventLogging-001) the cap loop terminates when the file is genuinely under the cap. An additional guard stops the loop if the on-disk size fails to decrease across an iteration, so a cap that can never be met no longer empties the whole table. Regression tests PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable (asserts the table is not emptied and the file ends under a realistic non-zero cap) and PurgeByStorageCap_RemovesOldestEventsFirst (asserts only the oldest events are removed).

SiteEventLogging-003 — Shared SqliteConnection used by purge and query without the write lock

Severity High
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114, src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72

Description

SiteEventLogger owns a single SqliteConnection and serialises its own writes via lock (_writeLock). EventLogPurgeService and EventLogQueryService both reach into _eventLogger.Connection and execute commands directly, without acquiring _writeLock. The purge runs on a BackgroundService thread (a different thread from event-recording callers and from the actor that drives the query service). A single SqliteConnection / SqliteCommand is not thread-safe; concurrent use from the purge thread and a recording thread (or query thread) can throw SqliteException/InvalidOperationException ("DataReader already open", "connection busy") or corrupt command state. The purge DELETE and the recorder INSERT racing is the most likely collision because event recording is continuous.

Recommendation

Funnel all access to the connection through a single synchronisation point: either expose lock-guarded methods on SiteEventLogger for purge/query to call, or give the purge and query services their own dedicated SqliteConnection instances (SQLite supports multiple connections to the same file; Cache=Shared plus a busy_timeout makes this safer). Do not share one SqliteConnection across threads.

Resolution

Resolved 2026-05-16 (commit <pending>): the raw internal Connection property was removed from SiteEventLogger and replaced with lock-guarded WithConnection(...) overloads that hold the existing _writeLock for the duration of the caller's delegate. EventLogPurgeService, EventLogQueryService, and LogEventAsync all now access the connection exclusively through WithConnection, so the purge thread, query thread, and recording threads are serialised on a single lock. Dispose was also brought under the lock to avoid a dispose/use race. Regression test PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection exercises purge running concurrently with multiple writer threads.

SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node

Severity Low
Original severity High (re-triaged down to Low on 2026-05-16 — see Re-triage note)
Category Design-document adherence
Status Won't Fix
Location src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336, src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25

Description

EventLogHandlerActor is hosted as a ClusterSingletonManager singleton with the stated intent that "queries always reach the active node". However, an Akka.NET cluster singleton is pinned to the oldest member of the role, which is not the same concept as the SCADA "active node" (the node currently running the Deployment Manager singleton / serving live traffic). The design doc is explicit: "Only the active node generates and stores events... the new active node starts logging to its own SQLite database." The event-log SQLite file is node-local and unreplicated. Nothing guarantees the event-log singleton co-locates with the active node, so a remote query can be served by the standby node and read that node's near-empty database, returning no events even though the active node has a full log. The explanatory comment in AkkaHostedService.cs asserts the opposite of what actually happens.

Recommendation

Either (a) host the query handler as a normal per-node actor and route queries to the active node explicitly (the node owning the Deployment Manager singleton), or (b) make the event-log writer follow the same singleton so the writer and the query handler are guaranteed co-located. Reconcile the design doc and the inline comment with whichever model is chosen.

Re-triage note (2026-05-16)

The finding's central claim — that a remote query "can be served by the standby node and read that node's near-empty database" — is incorrect for the query path. In AkkaHostedService.cs the event-log-handler ClusterSingletonManager and the deployment-manager ClusterSingletonManager are created with the same role (siteRole) in the same cluster. Akka.NET pins every cluster singleton of a given role to the oldest member of that role — so all same-role singletons in a cluster co-locate on one node. The "active node" in this codebase is, by definition, the node hosting the deployment-manager singleton; the event-log query singleton is therefore guaranteed to run on that same node. A ClusterClient query cannot land on the standby. The inline comment in AkkaHostedService.cs is accurate, not "the opposite of what happens".

A real but distinct concern exists: the writer (SiteEventLogger) is registered as a plain per-node DI singleton (AddSiteEventLogging), so it records to a local SQLite file on every node, including the standby. That wastes storage on the standby but does not cause the query-returns-nothing symptom the finding describes, because the query singleton always reads the active node's (populated) database. Gating the writer to the active node would be a ScadaLink.Host wiring change, outside this module's scope, and is a minor optimisation rather than a correctness defect.

Re-triaged from High to Low and closed as Won't Fix: the High-severity correctness claim does not hold. Any residual cleanup (gate the standby-node writer; the comment needs no change) can be raised as a fresh Low finding against ScadaLink.Host if desired.

Resolution

Won't Fix — 2026-05-16 (commit <pending>). Re-triaged: the asserted defect (query served by standby returning an empty log) cannot occur because the event-log query singleton and the deployment-manager singleton share a role and so always co-locate on the active node. No code change made; see the re-triage note above.

SiteEventLogging-005 — LogEventAsync performs synchronous disk I/O on the caller's thread

Severity Medium
Category Performance & resource management
Status Open
Location src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99

Description

LogEventAsync is declared async-shaped (returns Task, Async suffix) but its body is entirely synchronous: it takes lock (_writeLock), runs cmd.ExecuteNonQuery() (a blocking SQLite write), then returns Task.CompletedTask. Callers across the codebase invoke it fire-and-forget as _ = LogEventAsync(...) (e.g. ScriptExecutionActor.cs:133, DataConnectionActor.cs:292, ScriptActor.cs:250) expecting it to be non-blocking. In reality the SQLite write, and any contention on _writeLock, executes inline on the Akka actor thread of the calling subsystem. Under an event burst (alarm storm, script failure loop) this serialises actor threads on disk I/O and the global write lock, degrading the hot-path subsystems the design intends to keep responsive.

Recommendation

Either make recording genuinely asynchronous (offload to a dedicated single-threaded writer / Channel<T> consumer so callers truly fire-and-forget), or rename the method to LogEvent and document that it blocks, so callers can decide. Given the design's emphasis on not impacting runtime subsystems, an internal queue with a background flush is preferable.

Resolution

Unresolved.

SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths

Severity Low
Category Performance & resource management
Status Open
Location src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52, src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81

Description

InitializeSchema creates indexes on timestamp, event_type, and instance_id. The query service also filters on severity (severity = $severity) and performs message LIKE '%...%' / source LIKE '%...%' keyword search. severity has no index, and a leading-wildcard LIKE cannot use a normal index at all. With up to a 1 GB database and a 500-row page size, severity-filtered and keyword queries do full table scans on every page. The design explicitly lists keyword search as a supported, expected query type.

Recommendation

Add an index on severity (or a composite index aligned with common filter combinations such as (event_type, severity, id)). For keyword search, consider an FTS5 virtual table over message and source, or accept the scan but document the cost.

Resolution

Unresolved.

SiteEventLogging-007 — ISiteEventLogger consumers downcast to the concrete type and reach into the DB connection

Severity Medium
Category Code organization & conventions
Status Open
Location src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:25, src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:26, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34

Description

Both EventLogPurgeService and EventLogQueryService take ISiteEventLogger via DI and immediately downcast it: _eventLogger = (SiteEventLogger)eventLogger;. They then access the internal SqliteConnection Connection property to run arbitrary SQL. This defeats the purpose of the interface abstraction, makes the registration fragile (any ISiteEventLogger that is not exactly SiteEventLogger causes an InvalidCastException at construction), and leaks the database handle and raw SQL surface out of the recorder. It is also the root cause of the unsynchronised connection sharing in SiteEventLogging-003.

Recommendation

Introduce a proper data-access abstraction (e.g. an IEventLogStore with Insert, Query, PurgeOlderThan, PurgeToSize, GetSizeBytes) that owns the connection and its locking, and inject that into the recorder, query, and purge services. Remove the internal Connection property and the concrete-type downcasts.

Resolution

Unresolved.

SiteEventLogging-008 — Event-recording write failures are silently swallowed

Severity Medium
Category Error handling & resilience
Status Open
Location src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95

Description

If ExecuteNonQuery throws (disk full, database locked, file corruption), the exception is caught, written to ILogger, and discarded; LogEventAsync still returns Task.CompletedTask as if successful. Callers fire-and-forget the result so they cannot detect failure. The event log is the site's diagnostic audit trail; a sustained write failure (for example a locked-database storm caused by the unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to operators except a local log line that nobody is watching. There is no failure counter, no health-metric hook, and no retry.

Recommendation

Expose a failure signal: increment a counter that the Health Monitoring component can surface (the design notes script/alarm error rates are derived from the event log — a logging outage should be visible). At minimum, escalate repeated failures to a Warning/Error health metric rather than only a local log line.

Resolution

Unresolved.

SiteEventLogging-009 — XML doc on LogEventAsync claims asynchronous behaviour

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10, src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57

Description

The interface XML doc states "Record an event asynchronously." and the method is named LogEventAsync, but the implementation is fully synchronous (see SiteEventLogging-005). The documentation and naming are misleading: a reader will reasonably assume the write is offloaded and the caller's thread is not blocked, which is false. The details parameter doc says "Optional JSON details" but nothing validates or requires JSON, so callers may pass arbitrary text.

Recommendation

Align the name, signature, and documentation with the actual behaviour — either make the method genuinely asynchronous or rename to LogEvent and correct the doc. Clarify that details is free-form text unless JSON is actually enforced.

Resolution

Unresolved.

SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path

Severity Medium
Category Testing coverage
Status Open
Location tests/ScadaLink.SiteEventLogging.Tests/

Description

The test suite covers recording, query filtering/pagination, and basic purge, but several critical behaviours are untested:

  • EventLogHandlerActor has no test — the actor message contract (EventLogQueryRequest -> EventLogQueryResponse, Sender.Tell) is unverified.
  • No test exercises purge running concurrently with active writes/queries, so the connection-sharing race (SiteEventLogging-003) is invisible to CI.
  • PurgeByStorageCap is only tested with MaxStorageMb = 0, which hides the no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the file shrinks or that only oldest events are removed under a realistic cap.
  • EventLogQueryService.ExecuteQuery's catch block (Success: false, ErrorMessage) has no test.
  • SiteEventLogger.Dispose semantics (logging after dispose returns Task.CompletedTask) and re-entrant dispose are untested.

Recommendation

Add tests for the actor bridge, a concurrency stress test (purge + write + query in parallel), a realistic non-zero-cap purge test asserting size reduction and oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).

Resolution

Unresolved.

SiteEventLogging-011 — Stale "Phase 4+" placeholder in ServiceCollectionExtensions

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22

Description

AddSiteEventLoggingActors is an empty method with a comment "Placeholder for Akka actor registration (Phase 4+)". The actor (EventLogHandlerActor) is in fact already implemented and is registered directly in AkkaHostedService.cs:313-336, not through this method. The placeholder is dead code: it is either never called or called with no effect, and the comment is stale. A reader looking for where the event-log actor is wired up will be misdirected.

Recommendation

Either implement the actor registration here and have AkkaHostedService call it (centralising the wiring), or delete AddSiteEventLoggingActors entirely and remove the misleading comment.

Resolution

Unresolved.