docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and records the 2026-05-16 baseline review (commit 9c60592): 241 findings across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low). This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions
--- a/code-reviews/SiteEventLogging/findings.md
+++ b/code-reviews/SiteEventLogging/findings.md
@@ -0,0 +1,402 @@
+# Code Review — SiteEventLogging
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.SiteEventLogging` |
+| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-16 |
+| Reviewer | claude-agent |
+| Commit reviewed | `9c60592` |
+| Open findings | 11 |
+
+## Summary
+
+The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
+recorder (`SiteEventLogger`), a query service with keyset pagination, a background
+purge service, and a thin Akka actor bridge. The query path is parameterised
+correctly (no SQL injection) and reasonably well tested. However, the storage-cap
+enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because
+`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the
+database shrink and over-deletes the entire table when triggered. There is also a
+genuine concurrency hazard: the purge service and query service share the single
+`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge
+running on the background thread can collide with a write or a query on another
+thread. The `LogEventAsync` API is synchronous despite its name and `Task` return,
+which silently blocks Akka actor threads on disk I/O. Other findings concern the
+cluster-singleton placement of the handler actor (which can pin to the standby
+node), missing indexes for common query filters, retention/cap purge not enforcing
+the requirement strictly, and several documentation/maintainability issues.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
+| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
+| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
+| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
+| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
+| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
+| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
+| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
+| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
+| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
+
+## Findings
+
+### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |
+
+**Description**
+
+`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to
+reclaim space, then re-measures the database size via `page_count * page_size`.
+`incremental_vacuum` only has any effect when the database was created with
+`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the
+database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`,
+`incremental_vacuum` is silently ignored and `page_count` does not decrease when
+rows are deleted (free pages are retained in the file). Consequently the
+`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The
+storage-cap feature required by the design ("configurable maximum database size...
+oldest events are purged first") is therefore non-functional — it cannot bring the
+file back under the cap.
+
+**Recommendation**
+
+Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are
+created (it must be set before table creation or followed by a full `VACUUM` to take
+effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge
+deletes, or measure logical data size (e.g. `page_count - freelist_count` times
+`page_size`) instead of relying on `incremental_vacuum`.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |
+
+**Description**
+
+Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
+so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row
+batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table
+is completely empty. The design states the cap should purge "the oldest events...
+first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
+during an alarm storm) this destroys all retained diagnostic history rather than
+trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap`
+masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an
+empty table, so the over-delete behaviour is never exercised against a realistic cap.
+
+**Recommendation**
+
+Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
+the file is genuinely under the cap. Add a guard so the loop stops once
+`currentSizeBytes` has stopped decreasing across iterations, and add a test with a
+non-zero cap and a known oversized dataset to assert that only the oldest events are
+removed.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |
+
+**Description**
+
+`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via
+`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach
+into `_eventLogger.Connection` and execute commands directly, without acquiring
+`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from
+event-recording callers and from the actor that drives the query service). A single
+`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the
+purge thread and a recording thread (or query thread) can throw
+`SqliteException`/`InvalidOperationException` ("DataReader already open",
+"connection busy") or corrupt command state. The purge `DELETE` and the recorder
+`INSERT` racing is the most likely collision because event recording is continuous.
+
+**Recommendation**
+
+Funnel all access to the connection through a single synchronisation point: either
+expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the
+purge and query services their own dedicated `SqliteConnection` instances (SQLite
+supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout`
+makes this safer). Do not share one `SqliteConnection` across threads.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |
+
+**Description**
+
+`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the
+stated intent that "queries always reach the active node". However, an Akka.NET
+cluster singleton is pinned to the *oldest* member of the role, which is not the
+same concept as the SCADA "active node" (the node currently running the Deployment
+Manager singleton / serving live traffic). The design doc is explicit: "Only the
+active node generates and stores events... the new active node starts logging to its
+own SQLite database." The event-log SQLite file is node-local and unreplicated.
+Nothing guarantees the event-log singleton co-locates with the active node, so a
+remote query can be served by the standby node and read that node's near-empty
+database, returning no events even though the active node has a full log. The
+explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually
+happens.
+
+**Recommendation**
+
+Either (a) host the query handler as a normal per-node actor and route queries to
+the active node explicitly (the node owning the Deployment Manager singleton), or
+(b) make the event-log writer follow the same singleton so the writer and the query
+handler are guaranteed co-located. Reconcile the design doc and the inline comment
+with whichever model is chosen.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |
+
+**Description**
+
+`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its
+body is entirely synchronous: it takes `lock (_writeLock)`, runs
+`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`.
+Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)`
+(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`,
+`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write,
+and any contention on `_writeLock`, executes inline on the Akka actor thread of the
+calling subsystem. Under an event burst (alarm storm, script failure loop) this
+serialises actor threads on disk I/O and the global write lock, degrading the
+hot-path subsystems the design intends to keep responsive.
+
+**Recommendation**
+
+Either make recording genuinely asynchronous (offload to a dedicated single-threaded
+writer / `Channel<T>` consumer so callers truly fire-and-forget), or rename the
+method to `LogEvent` and document that it blocks, so callers can decide. Given the
+design's emphasis on not impacting runtime subsystems, an internal queue with a
+background flush is preferable.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` |
+
+**Description**
+
+`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`.
+The query service also filters on `severity` (`severity = $severity`) and performs
+`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no
+index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a
+1 GB database and a 500-row page size, severity-filtered and keyword queries do full
+table scans on every page. The design explicitly lists keyword search as a supported,
+expected query type.
+
+**Recommendation**
+
+Add an index on `severity` (or a composite index aligned with common filter
+combinations such as `(event_type, severity, id)`). For keyword search, consider an
+FTS5 virtual table over `message` and `source`, or accept the scan but document the
+cost.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:25`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:26`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34` |
+
+**Description**
+
+Both `EventLogPurgeService` and `EventLogQueryService` take `ISiteEventLogger` via
+DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. They
+then access the `internal SqliteConnection Connection` property to run arbitrary SQL.
+This defeats the purpose of the interface abstraction, makes the registration
+fragile (any `ISiteEventLogger` that is not exactly `SiteEventLogger` causes an
+`InvalidCastException` at construction), and leaks the database handle and raw SQL
+surface out of the recorder. It is also the root cause of the unsynchronised
+connection sharing in SiteEventLogging-003.
+
+**Recommendation**
+
+Introduce a proper data-access abstraction (e.g. an `IEventLogStore` with
+`Insert`, `Query`, `PurgeOlderThan`, `PurgeToSize`, `GetSizeBytes`) that owns the
+connection and its locking, and inject that into the recorder, query, and purge
+services. Remove the `internal Connection` property and the concrete-type downcasts.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-008 — Event-recording write failures are silently swallowed
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |
+
+**Description**
+
+If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the
+exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still
+returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so
+they cannot detect failure. The event log is the site's diagnostic audit trail; a
+sustained write failure (for example a locked-database storm caused by the
+unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
+operators except a local log line that nobody is watching. There is no failure
+counter, no health-metric hook, and no retry.
+
+**Recommendation**
+
+Expose a failure signal: increment a counter that the Health Monitoring component
+can surface (the design notes script/alarm error rates are derived from the event
+log — a logging outage should be visible). At minimum, escalate repeated failures to
+a Warning/Error health metric rather than only a local log line.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` |
+
+**Description**
+
+The interface XML doc states "Record an event asynchronously." and the method is
+named `LogEventAsync`, but the implementation is fully synchronous (see
+SiteEventLogging-005). The documentation and naming are misleading: a reader will
+reasonably assume the write is offloaded and the caller's thread is not blocked,
+which is false. The `details` parameter doc says "Optional JSON details" but nothing
+validates or requires JSON, so callers may pass arbitrary text.
+
+**Recommendation**
+
+Align the name, signature, and documentation with the actual behaviour — either make
+the method genuinely asynchronous or rename to `LogEvent` and correct the doc.
+Clarify that `details` is free-form text unless JSON is actually enforced.
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |
+
+**Description**
+
+The test suite covers recording, query filtering/pagination, and basic purge, but
+several critical behaviours are untested:
+
+- `EventLogHandlerActor` has no test — the actor message contract
+  (`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified.
+- No test exercises purge running concurrently with active writes/queries, so the
+  connection-sharing race (SiteEventLogging-003) is invisible to CI.
+- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the
+  no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the
+  file shrinks or that only oldest events are removed under a realistic cap.
+- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`,
+  `ErrorMessage`) has no test.
+- `SiteEventLogger.Dispose` semantics (logging after dispose returns
+  `Task.CompletedTask`) and re-entrant dispose are untested.
+
+**Recommendation**
+
+Add tests for the actor bridge, a concurrency stress test (purge + write + query in
+parallel), a realistic non-zero-cap purge test asserting size reduction and
+oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).
+
+**Resolution**
+
+_Unresolved._
+
+### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` |
+
+**Description**
+
+`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka
+actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already
+implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through
+this method. The placeholder is dead code: it is either never called or called with
+no effect, and the comment is stale. A reader looking for where the event-log actor
+is wired up will be misdirected.
+
+**Recommendation**
+
+Either implement the actor registration here and have `AkkaHostedService` call it
+(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove
+the misleading comment.
+
+**Resolution**
+
+_Unresolved._