fix(site-event-logging): resolve SiteEventLogging-005,007,008,010 — background async writer, drop concrete downcast, surface write failures, test coverage

This commit is contained in:
Joseph Doherty
2026-05-16 21:44:10 -04:00
parent 632d44f38c
commit 24a4a2d165
10 changed files with 559 additions and 53 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 7 |
| Open findings | 3 |
## Summary
@@ -241,7 +241,7 @@ on the active node. No code change made; see the re-triage note above.
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |
**Description**
@@ -267,7 +267,16 @@ background flush is preferable.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit `pending`): event recording is now offloaded to a
dedicated background writer. `SiteEventLogger` owns an unbounded `Channel<T>` and a
single background consumer thread; `LogEventAsync` only validates its arguments and
enqueues, so caller threads (Akka actor threads on hot paths) never block on the
SQLite write or on contention for the write lock. The returned `Task` completes once
the event is durably persisted (so `await` callers still observe write ordering) and
faults if the write fails. `Dispose` completes the channel and drains the writer.
Regression test `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` (verifies the
caller returns in <500 ms while the database is held busy) plus
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`.
### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
@@ -299,36 +308,51 @@ cost.
_Unresolved._
### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection
### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type
| | |
|--|--|
| Severity | Medium |
| Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:25`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:26`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23` |
**Description**
Both `EventLogPurgeService` and `EventLogQueryService` take `ISiteEventLogger` via
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. They
then access the `internal SqliteConnection Connection` property to run arbitrary SQL.
This defeats the purpose of the interface abstraction, makes the registration
fragile (any `ISiteEventLogger` that is not exactly `SiteEventLogger` causes an
`InvalidCastException` at construction), and leaks the database handle and raw SQL
surface out of the recorder. It is also the root cause of the unsynchronised
connection sharing in SiteEventLogging-003.
Both `EventLogPurgeService` and `EventLogQueryService` took `ISiteEventLogger` via
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. This
made the registration fragile — any `ISiteEventLogger` that is not exactly
`SiteEventLogger` (a test double, a decorator) caused an `InvalidCastException` at
construction — and defeated the purpose of the interface abstraction.
**Re-triage note (2026-05-16)**
The finding as originally written also claimed the services "access the `internal
SqliteConnection Connection` property to run arbitrary SQL" and called itself "the
root cause of the unsynchronised connection sharing in SiteEventLogging-003". That
part is stale: the resolution of SiteEventLogging-003 had already removed the
`internal Connection` property and replaced it with lock-guarded `WithConnection`
overloads. At the time this finding was actioned, the only remaining defect was the
concrete-type downcast itself. Severity stays Medium; the description is corrected to
the downcast-only scope.
**Recommendation**
Introduce a proper data-access abstraction (e.g. an `IEventLogStore` with
`Insert`, `Query`, `PurgeOlderThan`, `PurgeToSize`, `GetSizeBytes`) that owns the
connection and its locking, and inject that into the recorder, query, and purge
services. Remove the `internal Connection` property and the concrete-type downcasts.
Have the purge and query services depend on the concrete `SiteEventLogger` directly
(it is the type that owns the lock-guarded `WithConnection`), and register the
concrete type in DI with the interface forwarded to the same singleton. Remove the
fragile downcasts.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit `pending`): `AddSiteEventLogging` now registers
`SiteEventLogger` as a concrete singleton and forwards `ISiteEventLogger` to that
same instance. `EventLogPurgeService` and `EventLogQueryService` take a
`SiteEventLogger` constructor parameter directly, eliminating the
`(SiteEventLogger)eventLogger` downcast and its `InvalidCastException` risk. All
three services still share one connection/lock. Regression tests
`AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance` and
`PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast`.
### SiteEventLogging-008 — Event-recording write failures are silently swallowed
@@ -336,7 +360,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |
**Description**
@@ -359,7 +383,13 @@ a Warning/Error health metric rather than only a local log line.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit `pending`): write failures are no longer swallowed. The
background writer (introduced for SiteEventLogging-005) now, on an `INSERT` failure,
(a) increments a new `Interlocked`-guarded counter exposed as the public
`SiteEventLogger.FailedWriteCount` property — which Health Monitoring can poll to
detect a logging outage — and (b) faults the `Task` returned by `LogEventAsync` with
the exception instead of returning `Task.CompletedTask`. The error is still logged.
Regression test `LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError`.
### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour
@@ -395,7 +425,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Status | Resolved |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |
**Description**
@@ -423,7 +453,19 @@ oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connecti
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit `pending`): the remaining coverage gaps are now filled.
`EventLogHandlerActorTests` (using `Akka.TestKit.Xunit2`) exercises the actor message
contract — `EventLogQueryRequest` -> `EventLogQueryResponse` via `Sender.Tell`, for
both success and error responses. `EventLogCoverageTests` covers the previously
untested `EventLogQueryService.ExecuteQuery` catch block
(`ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable`) and the recorder's
disposed-state semantics (`LogEventAsync_AfterDispose_CompletesWithoutThrowing`,
`Dispose_IsIdempotent`). The purge/write concurrency, realistic-cap, and
oldest-first behaviours were already covered by the tests added when
SiteEventLogging-001/-002/-003 were resolved
(`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection`,
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`,
`PurgeByStorageCap_RemovesOldestEventsFirst`).
### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`