fix(site-event-logging): resolve SiteEventLogging-001/002/003, re-triage 004 — incremental auto_vacuum, cap-purge guard, write-lock connection access

2026-05-16 19:47:17 -04:00
parent 0d9363766d
commit 0529cf2d40
6 changed files with 411 additions and 99 deletions
--- a/code-reviews/SiteEventLogging/findings.md
+++ b/code-reviews/SiteEventLogging/findings.md
@@ -8,7 +8,7 @@
 | Last reviewed | 2026-05-16 |
 | Reviewer | claude-agent |
 | Commit reviewed | `9c60592` |
-| Open findings | 11 |
+| Open findings | 7 |

 ## Summary

@@ -51,7 +51,7 @@ the requirement strictly, and several documentation/maintainability issues.
 |--|--|
 | Severity | High |
 | Category | Correctness & logic bugs |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |

 **Description**
@@ -78,7 +78,12 @@ deletes, or measure logical data size (e.g. `page_count - freelist_count` times

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit `<pending>`): `InitializeSchema` now sets
+`PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and
+`GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) *
+page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now
+reliably observes the database shrinking. Regression test
+`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`.

 ### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed

@@ -86,7 +91,7 @@ _Unresolved._
 |--|--|
 | Severity | High |
 | Category | Design-document adherence |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |

 **Description**
@@ -111,7 +116,14 @@ removed.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit `<pending>`): with the size measurement fixed
+(SiteEventLogging-001) the cap loop terminates when the file is genuinely under the
+cap. An additional guard stops the loop if the on-disk size fails to decrease across
+an iteration, so a cap that can never be met no longer empties the whole table.
+Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the
+table is not emptied and the file ends under a realistic non-zero cap) and
+`PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are
+removed).

 ### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock

@@ -119,7 +131,7 @@ _Unresolved._
 |--|--|
 | Severity | High |
 | Category | Concurrency & thread safety |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |

 **Description**
@@ -145,15 +157,24 @@ makes this safer). Do not share one `SqliteConnection` across threads.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit `<pending>`): the raw `internal Connection` property was
+removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)`
+overloads that hold the existing `_writeLock` for the duration of the caller's
+delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now
+access the connection exclusively through `WithConnection`, so the purge thread,
+query thread, and recording threads are serialised on a single lock. `Dispose` was
+also brought under the lock to avoid a dispose/use race. Regression test
+`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running
+concurrently with multiple writer threads.

 ### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node

 | | |
 |--|--|
-| Severity | High |
+| Severity | Low |
+| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
 | Category | Design-document adherence |
-| Status | Open |
+| Status | Won't Fix |
 | Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |

 **Description**
@@ -179,9 +200,40 @@ the active node explicitly (the node owning the Deployment Manager singleton), o
 handler are guaranteed co-located. Reconcile the design doc and the inline comment
 with whichever model is chosen.

+**Re-triage note (2026-05-16)**
+
+The finding's central claim — that a remote query "can be served by the standby
+node and read that node's near-empty database" — is incorrect for the query path.
+In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the
+`deployment-manager` `ClusterSingletonManager` are created with the **same role**
+(`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a
+given role to the *oldest member of that role* — so all same-role singletons in a
+cluster co-locate on one node. The "active node" in this codebase is, by definition,
+the node hosting the `deployment-manager` singleton; the event-log query singleton
+is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot
+land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not
+"the opposite of what happens".
+
+A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered
+as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local
+SQLite file on **every** node, including the standby. That wastes storage on the
+standby but does **not** cause the query-returns-nothing symptom the finding
+describes, because the query singleton always reads the *active* node's (populated)
+database. Gating the writer to the active node would be a `ScadaLink.Host` wiring
+change, outside this module's scope, and is a minor optimisation rather than a
+correctness defect.
+
+Re-triaged from High to Low and closed as **Won't Fix**: the High-severity
+correctness claim does not hold. Any residual cleanup (gate the standby-node writer;
+the comment needs no change) can be raised as a fresh Low finding against
+`ScadaLink.Host` if desired.
+
 **Resolution**

-_Unresolved._
+Won't Fix — 2026-05-16 (commit `<pending>`). Re-triaged: the asserted defect (query
+served by standby returning an empty log) cannot occur because the event-log query
+singleton and the deployment-manager singleton share a role and so always co-locate
+on the active node. No code change made; see the re-triage note above.

 ### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread