fix(site-event-logging): resolve SiteEventLogging-001/002/003, re-triage 004 — incremental auto_vacuum, cap-purge guard, write-lock connection access
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 11 |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -51,7 +51,7 @@ the requirement strictly, and several documentation/maintainability issues.
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |
|
||||
|
||||
**Description**
|
||||
@@ -78,7 +78,12 @@ deletes, or measure logical data size (e.g. `page_count - freelist_count` times
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `<pending>`): `InitializeSchema` now sets
|
||||
`PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and
|
||||
`GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) *
|
||||
page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now
|
||||
reliably observes the database shrinking. Regression test
|
||||
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`.
|
||||
|
||||
### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
|
||||
|
||||
@@ -86,7 +91,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |
|
||||
|
||||
**Description**
|
||||
@@ -111,7 +116,14 @@ removed.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `<pending>`): with the size measurement fixed
|
||||
(SiteEventLogging-001) the cap loop terminates when the file is genuinely under the
|
||||
cap. An additional guard stops the loop if the on-disk size fails to decrease across
|
||||
an iteration, so a cap that can never be met no longer empties the whole table.
|
||||
Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the
|
||||
table is not emptied and the file ends under a realistic non-zero cap) and
|
||||
`PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are
|
||||
removed).
|
||||
|
||||
### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock
|
||||
|
||||
@@ -119,7 +131,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |
|
||||
|
||||
**Description**
|
||||
@@ -145,15 +157,24 @@ makes this safer). Do not share one `SqliteConnection` across threads.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `<pending>`): the raw `internal Connection` property was
|
||||
removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)`
|
||||
overloads that hold the existing `_writeLock` for the duration of the caller's
|
||||
delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now
|
||||
access the connection exclusively through `WithConnection`, so the purge thread,
|
||||
query thread, and recording threads are serialised on a single lock. `Dispose` was
|
||||
also brought under the lock to avoid a dispose/use race. Regression test
|
||||
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running
|
||||
concurrently with multiple writer threads.
|
||||
|
||||
### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Severity | Low |
|
||||
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Won't Fix |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |
|
||||
|
||||
**Description**
|
||||
@@ -179,9 +200,40 @@ the active node explicitly (the node owning the Deployment Manager singleton), o
|
||||
handler are guaranteed co-located. Reconcile the design doc and the inline comment
|
||||
with whichever model is chosen.
|
||||
|
||||
**Re-triage note (2026-05-16)**
|
||||
|
||||
The finding's central claim — that a remote query "can be served by the standby
|
||||
node and read that node's near-empty database" — is incorrect for the query path.
|
||||
In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the
|
||||
`deployment-manager` `ClusterSingletonManager` are created with the **same role**
|
||||
(`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a
|
||||
given role to the *oldest member of that role* — so all same-role singletons in a
|
||||
cluster co-locate on one node. The "active node" in this codebase is, by definition,
|
||||
the node hosting the `deployment-manager` singleton; the event-log query singleton
|
||||
is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot
|
||||
land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not
|
||||
"the opposite of what happens".
|
||||
|
||||
A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered
|
||||
as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local
|
||||
SQLite file on **every** node, including the standby. That wastes storage on the
|
||||
standby but does **not** cause the query-returns-nothing symptom the finding
|
||||
describes, because the query singleton always reads the *active* node's (populated)
|
||||
database. Gating the writer to the active node would be a `ScadaLink.Host` wiring
|
||||
change, outside this module's scope, and is a minor optimisation rather than a
|
||||
correctness defect.
|
||||
|
||||
Re-triaged from High to Low and closed as **Won't Fix**: the High-severity
|
||||
correctness claim does not hold. Any residual cleanup (gate the standby-node writer;
|
||||
the comment needs no change) can be raised as a fresh Low finding against
|
||||
`ScadaLink.Host` if desired.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Won't Fix — 2026-05-16 (commit `<pending>`). Re-triaged: the asserted defect (query
|
||||
served by standby returning an empty log) cannot occur because the event-log query
|
||||
singleton and the deployment-manager singleton share a role and so always co-locate
|
||||
on the active node. No code change made; see the re-triage note above.
|
||||
|
||||
### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread
|
||||
|
||||
|
||||
Reference in New Issue
Block a user