docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
This commit is contained in:
Joseph Doherty
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions

View File

@@ -0,0 +1,402 @@
# Code Review — SiteEventLogging
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
## Summary
The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
recorder (`SiteEventLogger`), a query service with keyset pagination, a background
purge service, and a thin Akka actor bridge. The query path is parameterised
correctly (no SQL injection) and reasonably well tested. However, the storage-cap
enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because
`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the
database shrink and over-deletes the entire table when triggered. There is also a
genuine concurrency hazard: the purge service and query service share the single
`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge
running on the background thread can collide with a write or a query on another
thread. The `LogEventAsync` API is synchronous despite its name and `Task` return,
which silently blocks Akka actor threads on disk I/O. Other findings concern the
cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |
**Description**
`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to
reclaim space, then re-measures the database size via `page_count * page_size`.
`incremental_vacuum` only has any effect when the database was created with
`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the
database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`,
`incremental_vacuum` is silently ignored and `page_count` does not decrease when
rows are deleted (free pages are retained in the file). Consequently the
`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The
storage-cap feature required by the design ("configurable maximum database size...
oldest events are purged first") is therefore non-functional — it cannot bring the
file back under the cap.
**Recommendation**
Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are
created (it must be set before table creation or followed by a full `VACUUM` to take
effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge
deletes, or measure logical data size (e.g. `page_count - freelist_count` times
`page_size`) instead of relying on `incremental_vacuum`.
**Resolution**
_Unresolved._
### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |
**Description**
Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row
batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table
is completely empty. The design states the cap should purge "the oldest events...
first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
during an alarm storm) this destroys all retained diagnostic history rather than
trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap`
masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an
empty table, so the over-delete behaviour is never exercised against a realistic cap.
**Recommendation**
Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
the file is genuinely under the cap. Add a guard so the loop stops once
`currentSizeBytes` has stopped decreasing across iterations, and add a test with a
non-zero cap and a known oversized dataset to assert that only the oldest events are
removed.
**Resolution**
_Unresolved._
### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |
**Description**
`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via
`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach
into `_eventLogger.Connection` and execute commands directly, without acquiring
`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from
event-recording callers and from the actor that drives the query service). A single
`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the
purge thread and a recording thread (or query thread) can throw
`SqliteException`/`InvalidOperationException` ("DataReader already open",
"connection busy") or corrupt command state. The purge `DELETE` and the recorder
`INSERT` racing is the most likely collision because event recording is continuous.
**Recommendation**
Funnel all access to the connection through a single synchronisation point: either
expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the
purge and query services their own dedicated `SqliteConnection` instances (SQLite
supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout`
makes this safer). Do not share one `SqliteConnection` across threads.
**Resolution**
_Unresolved._
### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |
**Description**
`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the
stated intent that "queries always reach the active node". However, an Akka.NET
cluster singleton is pinned to the *oldest* member of the role, which is not the
same concept as the SCADA "active node" (the node currently running the Deployment
Manager singleton / serving live traffic). The design doc is explicit: "Only the
active node generates and stores events... the new active node starts logging to its
own SQLite database." The event-log SQLite file is node-local and unreplicated.
Nothing guarantees the event-log singleton co-locates with the active node, so a
remote query can be served by the standby node and read that node's near-empty
database, returning no events even though the active node has a full log. The
explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually
happens.
**Recommendation**
Either (a) host the query handler as a normal per-node actor and route queries to
the active node explicitly (the node owning the Deployment Manager singleton), or
(b) make the event-log writer follow the same singleton so the writer and the query
handler are guaranteed co-located. Reconcile the design doc and the inline comment
with whichever model is chosen.
**Resolution**
_Unresolved._
### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |
**Description**
`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its
body is entirely synchronous: it takes `lock (_writeLock)`, runs
`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`.
Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)`
(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`,
`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write,
and any contention on `_writeLock`, executes inline on the Akka actor thread of the
calling subsystem. Under an event burst (alarm storm, script failure loop) this
serialises actor threads on disk I/O and the global write lock, degrading the
hot-path subsystems the design intends to keep responsive.
**Recommendation**
Either make recording genuinely asynchronous (offload to a dedicated single-threaded
writer / `Channel<T>` consumer so callers truly fire-and-forget), or rename the
method to `LogEvent` and document that it blocks, so callers can decide. Given the
design's emphasis on not impacting runtime subsystems, an internal queue with a
background flush is preferable.
**Resolution**
_Unresolved._
### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` |
**Description**
`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`.
The query service also filters on `severity` (`severity = $severity`) and performs
`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no
index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a
1 GB database and a 500-row page size, severity-filtered and keyword queries do full
table scans on every page. The design explicitly lists keyword search as a supported,
expected query type.
**Recommendation**
Add an index on `severity` (or a composite index aligned with common filter
combinations such as `(event_type, severity, id)`). For keyword search, consider an
FTS5 virtual table over `message` and `source`, or accept the scan but document the
cost.
**Resolution**
_Unresolved._
### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:25`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:26`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34` |
**Description**
Both `EventLogPurgeService` and `EventLogQueryService` take `ISiteEventLogger` via
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. They
then access the `internal SqliteConnection Connection` property to run arbitrary SQL.
This defeats the purpose of the interface abstraction, makes the registration
fragile (any `ISiteEventLogger` that is not exactly `SiteEventLogger` causes an
`InvalidCastException` at construction), and leaks the database handle and raw SQL
surface out of the recorder. It is also the root cause of the unsynchronised
connection sharing in SiteEventLogging-003.
**Recommendation**
Introduce a proper data-access abstraction (e.g. an `IEventLogStore` with
`Insert`, `Query`, `PurgeOlderThan`, `PurgeToSize`, `GetSizeBytes`) that owns the
connection and its locking, and inject that into the recorder, query, and purge
services. Remove the `internal Connection` property and the concrete-type downcasts.
**Resolution**
_Unresolved._
### SiteEventLogging-008 — Event-recording write failures are silently swallowed
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |
**Description**
If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the
exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still
returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so
they cannot detect failure. The event log is the site's diagnostic audit trail; a
sustained write failure (for example a locked-database storm caused by the
unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
operators except a local log line that nobody is watching. There is no failure
counter, no health-metric hook, and no retry.
**Recommendation**
Expose a failure signal: increment a counter that the Health Monitoring component
can surface (the design notes script/alarm error rates are derived from the event
log — a logging outage should be visible). At minimum, escalate repeated failures to
a Warning/Error health metric rather than only a local log line.
**Resolution**
_Unresolved._
### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` |
**Description**
The interface XML doc states "Record an event asynchronously." and the method is
named `LogEventAsync`, but the implementation is fully synchronous (see
SiteEventLogging-005). The documentation and naming are misleading: a reader will
reasonably assume the write is offloaded and the caller's thread is not blocked,
which is false. The `details` parameter doc says "Optional JSON details" but nothing
validates or requires JSON, so callers may pass arbitrary text.
**Recommendation**
Align the name, signature, and documentation with the actual behaviour — either make
the method genuinely asynchronous or rename to `LogEvent` and correct the doc.
Clarify that `details` is free-form text unless JSON is actually enforced.
**Resolution**
_Unresolved._
### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |
**Description**
The test suite covers recording, query filtering/pagination, and basic purge, but
several critical behaviours are untested:
- `EventLogHandlerActor` has no test — the actor message contract
(`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified.
- No test exercises purge running concurrently with active writes/queries, so the
connection-sharing race (SiteEventLogging-003) is invisible to CI.
- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the
no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the
file shrinks or that only oldest events are removed under a realistic cap.
- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`,
`ErrorMessage`) has no test.
- `SiteEventLogger.Dispose` semantics (logging after dispose returns
`Task.CompletedTask`) and re-entrant dispose are untested.
**Recommendation**
Add tests for the actor bridge, a concurrency stress test (purge + write + query in
parallel), a realistic non-zero-cap purge test asserting size reduction and
oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).
**Resolution**
_Unresolved._
### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` |
**Description**
`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka
actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already
implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through
this method. The placeholder is dead code: it is either never called or called with
no effect, and the comment is stale. A reader looking for where the event-log actor
is wired up will be misdirected.
**Recommendation**
Either implement the actor registration here and have `AkkaHostedService` call it
(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove
the misleading comment.
**Resolution**
_Unresolved._