819f1b4665
Each finding is a focused validation guard or upper bound at a trust boundary.
Highlights:
- Commons-015: EncryptionMetadata ctor now validates Algorithm (AES-256-GCM
only), Kdf (PBKDF2-SHA256 only), Iterations ([100k, 10M]), non-null Salt/IV.
- Transport-004: new BundleUnlockRateLimiter (sliding-window, per-key,
singleton) wired into BundleImporter.LoadAsync; over-budget callers see
BundleUnlockRateLimitedException. Per-bundle 3-strike + per-window cap.
- ESG-022: ExternalSystemClient.InvokeHttpAsync allow-lists the documented
GET/POST/PUT/PATCH/DELETE set (case-insensitive); unknown verbs throw.
- SEL-015: SiteEventLogger queue now bounded (10k cap, DropOldest); dropped
events fault their Task and increment FailedWriteCount so the drop is
observable instead of an unbounded memory growth.
- SEL-017: EventLogQueryService clamps caller-supplied PageSize to a new
MaxQueryPageSize cap (default 500) so int.MaxValue can't OOM the host.
- SEL-020: LogEventAsync rejects severities outside {Info, Warning, Error}
(matches SQLite BINARY-collation query filter).
- InboundAPI-020: ContentType "json" check now case-insensitive
(application/JSON no longer slips through as not-json).
- InboundAPI-024: _knownBadMethods capped at 1000 entries (drops new entries
once full); per-request DB lookup remains the correctness path.
- SR-025: HandleSetStaticAttribute validates the attribute name against the
deployed config; unknown names now return Success=false instead of
leaking orphan override rows into the SQLite store.
- TE-021: MoveTemplateAsync runs the sibling-name-collision check at the
destination, mirroring TemplateFolderService.MoveFolderAsync.
- TE-022: LockEnforcer's once-locked-stays-locked rule now also covers
LockedInDerived (was previously only IsLocked).
New regression tests across 8 test projects (EncryptionMetadata, rate
limiter, ESG client allow-list, SEL bounded channel / PageSize clamp /
severity validation, InboundAPI ContentType + bad-methods cap, SiteRT
unknown-attribute, TemplateEngine MoveTemplate + LockedInDerived).
Build clean; affected suites all green. README regenerated: 93 open (was 104).
Note: a separate manual re-run was needed for the SiteEventLogging hunk
because its initial subagent's source edits never landed on disk despite
reporting success (file-collision-style failure mode).
1135 lines
62 KiB
Markdown
1135 lines
62 KiB
Markdown
# Code Review — SiteEventLogging
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.SiteEventLogging` |
|
|
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-28 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `1eb6e97` |
|
|
| Open findings | 5 |
|
|
|
|
## Summary
|
|
|
|
The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
|
|
recorder (`SiteEventLogger`), a query service with keyset pagination, a background
|
|
purge service, and a thin Akka actor bridge. The query path is parameterised
|
|
correctly (no SQL injection) and reasonably well tested. However, the storage-cap
|
|
enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because
|
|
`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the
|
|
database shrink and over-deletes the entire table when triggered. There is also a
|
|
genuine concurrency hazard: the purge service and query service share the single
|
|
`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge
|
|
running on the background thread can collide with a write or a query on another
|
|
thread. The `LogEventAsync` API is synchronous despite its name and `Task` return,
|
|
which silently blocks Akka actor threads on disk I/O. Other findings concern the
|
|
cluster-singleton placement of the handler actor (which can pin to the standby
|
|
node), missing indexes for common query filters, retention/cap purge not enforcing
|
|
the requirement strictly, and several documentation/maintainability issues.
|
|
|
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
|
|
|
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
|
|
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
|
|
hold up under inspection — the background writer, lock-guarded `WithConnection`,
|
|
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
|
|
the concrete-recorder DI wiring are all present and correct at this commit. The
|
|
module source is byte-identical between `39d737e` and current `HEAD`, so this review
|
|
reflects the live code. Three new findings were recorded, all low-to-medium and none
|
|
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
|
|
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
|
|
be persisted because the logger has been disposed, the returned `Task` is completed
|
|
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
|
|
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
|
|
keyword-search filter (SiteEventLogging-013) and a claimed initial-purge block on the
|
|
host startup thread (SiteEventLogging-014 — later re-triaged to Won't Fix, the
|
|
premise does not hold on .NET 8+).
|
|
|
|
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
|
|
|
Re-reviewed the module at commit `1eb6e97`. All fourteen prior findings remain closed
|
|
and their resolutions hold up under inspection: the lock-guarded `WithConnection`
|
|
overloads, the background-writer `Channel<T>` with disposed-mid-drain fault
|
|
propagation, the `auto_vacuum = INCREMENTAL` schema + logical-size measurement, the
|
|
severity index, the `LIKE` keyword-search escaping, and the concrete-recorder DI
|
|
wiring are all present and correct at this commit. Nine new findings were recorded —
|
|
none are regressions of prior fixes. The most notable (SiteEventLogging-016, **High**)
|
|
is a correctness defect in the query path: timestamps are stored as ISO 8601 strings
|
|
generated from `DateTimeOffset.UtcNow` (so they always have a `+00:00` offset suffix),
|
|
but the `From`/`To` filters are stringified verbatim via `request.From.Value.ToString("o")`
|
|
without normalising to UTC, so a central client that sends a non-UTC `DateTimeOffset`
|
|
gets a broken lexicographic comparison and either spuriously includes or excludes
|
|
events. The next-most-notable findings are SiteEventLogging-015 (unbounded background
|
|
write queue can grow without limit under sustained writer slowness — sister
|
|
`SqliteAuditWriter` uses a bounded channel) and SiteEventLogging-017 (the central
|
|
client's `PageSize` is used verbatim with no upper-bound clamp, defeating the design's
|
|
"prevents broad queries from overwhelming the communication channel" rationale). The
|
|
remaining findings are low-severity hygiene / documentation: an unused
|
|
`FailedWriteCount` metric, untyped severity/event-type fields, non-invariant culture
|
|
parsing, the purge service running on the standby node, the redundant `Cache=Shared`
|
|
on a single-connection logger, and a non-volatile stop flag in a concurrency stress
|
|
test.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
|
|
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
|
|
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
|
|
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
|
|
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
|
|
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: claimed initial-purge startup-thread block (-014 — Won't Fix, premise invalid on .NET 8+). |
|
|
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
|
|
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
|
|
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
|
|
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
|
|
|
|
_Re-review (2026-05-28, `1eb6e97`):_
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ☑ | `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps (-016); `DateTimeOffset.Parse` without invariant culture is culture-sensitive (-021); severity/event-type accept any non-empty string with no schema enforcement (-020). |
|
|
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` is a simple `Receive`/`Tell` bridge with no supervision concerns of its own; no new findings. |
|
|
| 3 | Concurrency & thread safety | ☑ | Concurrent-write stress test uses a non-volatile `stop` flag (-023). The shared-connection lock pattern is correct post-SiteEventLogging-003. |
|
|
| 4 | Error handling & resilience | ☑ | `FailedWriteCount` is exposed but nothing in Health Monitoring polls it — the metric is unobserved (-018). |
|
|
| 5 | Security | ☑ | Queries are fully parameterised. `PageSize` and `KeywordFilter` from the central client are not bounded (-017) — a hostile or buggy central could request `int.MaxValue` rows or multi-MB `LIKE` patterns. |
|
|
| 6 | Performance & resource management | ☑ | Background write queue is unbounded (-015); `Cache=Shared` is redundant for a single-connection logger (-022); upper-bound on `PageSize` missing (-017). |
|
|
| 7 | Design-document adherence | ☑ | `EventLogPurgeService` is registered as a per-host `BackgroundService` and runs on the standby too, but the design says "the daily background job runs on the active node" (-019). |
|
|
| 8 | Code organization & conventions | ☑ | `FailedWriteCount` is on the concrete `SiteEventLogger`, not on `ISiteEventLogger`, so any future non-concrete consumer cannot read it (-018). |
|
|
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
|
|
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
|
|
|
|
## Findings
|
|
|
|
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |
|
|
|
|
**Description**
|
|
|
|
`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to
|
|
reclaim space, then re-measures the database size via `page_count * page_size`.
|
|
`incremental_vacuum` only has any effect when the database was created with
|
|
`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the
|
|
database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`,
|
|
`incremental_vacuum` is silently ignored and `page_count` does not decrease when
|
|
rows are deleted (free pages are retained in the file). Consequently the
|
|
`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The
|
|
storage-cap feature required by the design ("configurable maximum database size...
|
|
oldest events are purged first") is therefore non-functional — it cannot bring the
|
|
file back under the cap.
|
|
|
|
**Recommendation**
|
|
|
|
Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are
|
|
created (it must be set before table creation or followed by a full `VACUUM` to take
|
|
effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge
|
|
deletes, or measure logical data size (e.g. `page_count - freelist_count` times
|
|
`page_size`) instead of relying on `incremental_vacuum`.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`): `InitializeSchema` now sets
|
|
`PRAGMA auto_vacuum = INCREMENTAL` before any table is created, and
|
|
`GetDatabaseSizeBytes` measures logical size as `(page_count - freelist_count) *
|
|
page_size` so reclaimed pages no longer mask the size drop. The cap-purge loop now
|
|
reliably observes the database shrinking. Regression test
|
|
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`.
|
|
|
|
### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |
|
|
|
|
**Description**
|
|
|
|
Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
|
|
so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row
|
|
batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table
|
|
is completely empty. The design states the cap should purge "the oldest events...
|
|
first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
|
|
during an alarm storm) this destroys all retained diagnostic history rather than
|
|
trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap`
|
|
masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an
|
|
empty table, so the over-delete behaviour is never exercised against a realistic cap.
|
|
|
|
**Recommendation**
|
|
|
|
Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
|
|
the file is genuinely under the cap. Add a guard so the loop stops once
|
|
`currentSizeBytes` has stopped decreasing across iterations, and add a test with a
|
|
non-zero cap and a known oversized dataset to assert that only the oldest events are
|
|
removed.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`): with the size measurement fixed
|
|
(SiteEventLogging-001) the cap loop terminates when the file is genuinely under the
|
|
cap. An additional guard stops the loop if the on-disk size fails to decrease across
|
|
an iteration, so a cap that can never be met no longer empties the whole table.
|
|
Regression tests `PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable` (asserts the
|
|
table is not emptied and the file ends under a realistic non-zero cap) and
|
|
`PurgeByStorageCap_RemovesOldestEventsFirst` (asserts only the oldest events are
|
|
removed).
|
|
|
|
### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |
|
|
|
|
**Description**
|
|
|
|
`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via
|
|
`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach
|
|
into `_eventLogger.Connection` and execute commands directly, without acquiring
|
|
`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from
|
|
event-recording callers and from the actor that drives the query service). A single
|
|
`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the
|
|
purge thread and a recording thread (or query thread) can throw
|
|
`SqliteException`/`InvalidOperationException` ("DataReader already open",
|
|
"connection busy") or corrupt command state. The purge `DELETE` and the recorder
|
|
`INSERT` racing is the most likely collision because event recording is continuous.
|
|
|
|
**Recommendation**
|
|
|
|
Funnel all access to the connection through a single synchronisation point: either
|
|
expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the
|
|
purge and query services their own dedicated `SqliteConnection` instances (SQLite
|
|
supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout`
|
|
makes this safer). Do not share one `SqliteConnection` across threads.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`): the raw `internal Connection` property was
|
|
removed from `SiteEventLogger` and replaced with lock-guarded `WithConnection(...)`
|
|
overloads that hold the existing `_writeLock` for the duration of the caller's
|
|
delegate. `EventLogPurgeService`, `EventLogQueryService`, and `LogEventAsync` all now
|
|
access the connection exclusively through `WithConnection`, so the purge thread,
|
|
query thread, and recording threads are serialised on a single lock. `Dispose` was
|
|
also brought under the lock to avoid a dispose/use race. Regression test
|
|
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` exercises purge running
|
|
concurrently with multiple writer threads.
|
|
|
|
### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
|
|
| Category | Design-document adherence |
|
|
| Status | Won't Fix |
|
|
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |
|
|
|
|
**Description**
|
|
|
|
`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the
|
|
stated intent that "queries always reach the active node". However, an Akka.NET
|
|
cluster singleton is pinned to the *oldest* member of the role, which is not the
|
|
same concept as the SCADA "active node" (the node currently running the Deployment
|
|
Manager singleton / serving live traffic). The design doc is explicit: "Only the
|
|
active node generates and stores events... the new active node starts logging to its
|
|
own SQLite database." The event-log SQLite file is node-local and unreplicated.
|
|
Nothing guarantees the event-log singleton co-locates with the active node, so a
|
|
remote query can be served by the standby node and read that node's near-empty
|
|
database, returning no events even though the active node has a full log. The
|
|
explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually
|
|
happens.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) host the query handler as a normal per-node actor and route queries to
|
|
the active node explicitly (the node owning the Deployment Manager singleton), or
|
|
(b) make the event-log writer follow the same singleton so the writer and the query
|
|
handler are guaranteed co-located. Reconcile the design doc and the inline comment
|
|
with whichever model is chosen.
|
|
|
|
**Re-triage note (2026-05-16)**
|
|
|
|
The finding's central claim — that a remote query "can be served by the standby
|
|
node and read that node's near-empty database" — is incorrect for the query path.
|
|
In `AkkaHostedService.cs` the `event-log-handler` `ClusterSingletonManager` and the
|
|
`deployment-manager` `ClusterSingletonManager` are created with the **same role**
|
|
(`siteRole`) in the **same cluster**. Akka.NET pins every cluster singleton of a
|
|
given role to the *oldest member of that role* — so all same-role singletons in a
|
|
cluster co-locate on one node. The "active node" in this codebase is, by definition,
|
|
the node hosting the `deployment-manager` singleton; the event-log query singleton
|
|
is therefore *guaranteed* to run on that same node. A `ClusterClient` query cannot
|
|
land on the standby. The inline comment in `AkkaHostedService.cs` is accurate, not
|
|
"the opposite of what happens".
|
|
|
|
A real but distinct concern exists: the *writer* (`SiteEventLogger`) is registered
|
|
as a plain per-node DI singleton (`AddSiteEventLogging`), so it records to a local
|
|
SQLite file on **every** node, including the standby. That wastes storage on the
|
|
standby but does **not** cause the query-returns-nothing symptom the finding
|
|
describes, because the query singleton always reads the *active* node's (populated)
|
|
database. Gating the writer to the active node would be a `ScadaLink.Host` wiring
|
|
change, outside this module's scope, and is a minor optimisation rather than a
|
|
correctness defect.
|
|
|
|
Re-triaged from High to Low and closed as **Won't Fix**: the High-severity
|
|
correctness claim does not hold. Any residual cleanup (gate the standby-node writer;
|
|
the comment needs no change) can be raised as a fresh Low finding against
|
|
`ScadaLink.Host` if desired.
|
|
|
|
**Resolution**
|
|
|
|
Won't Fix — 2026-05-16 (commit `<pending>`). Re-triaged: the asserted defect (query
|
|
served by standby returning an empty log) cannot occur because the event-log query
|
|
singleton and the deployment-manager singleton share a role and so always co-locate
|
|
on the active node. No code change made; see the re-triage note above.
|
|
|
|
### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |
|
|
|
|
**Description**
|
|
|
|
`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its
|
|
body is entirely synchronous: it takes `lock (_writeLock)`, runs
|
|
`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`.
|
|
Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)`
|
|
(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`,
|
|
`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write,
|
|
and any contention on `_writeLock`, executes inline on the Akka actor thread of the
|
|
calling subsystem. Under an event burst (alarm storm, script failure loop) this
|
|
serialises actor threads on disk I/O and the global write lock, degrading the
|
|
hot-path subsystems the design intends to keep responsive.
|
|
|
|
**Recommendation**
|
|
|
|
Either make recording genuinely asynchronous (offload to a dedicated single-threaded
|
|
writer / `Channel<T>` consumer so callers truly fire-and-forget), or rename the
|
|
method to `LogEvent` and document that it blocks, so callers can decide. Given the
|
|
design's emphasis on not impacting runtime subsystems, an internal queue with a
|
|
background flush is preferable.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): event recording is now offloaded to a
|
|
dedicated background writer. `SiteEventLogger` owns an unbounded `Channel<T>` and a
|
|
single background consumer thread; `LogEventAsync` only validates its arguments and
|
|
enqueues, so caller threads (Akka actor threads on hot paths) never block on the
|
|
SQLite write or on contention for the write lock. The returned `Task` completes once
|
|
the event is durably persisted (so `await` callers still observe write ordering) and
|
|
faults if the write fails. `Dispose` completes the channel and drains the writer.
|
|
Regression test `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` (verifies the
|
|
caller returns in <500 ms while the database is held busy) plus
|
|
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`.
|
|
|
|
### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` |
|
|
|
|
**Description**
|
|
|
|
`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`.
|
|
The query service also filters on `severity` (`severity = $severity`) and performs
|
|
`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no
|
|
index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a
|
|
1 GB database and a 500-row page size, severity-filtered and keyword queries do full
|
|
table scans on every page. The design explicitly lists keyword search as a supported,
|
|
expected query type.
|
|
|
|
**Recommendation**
|
|
|
|
Add an index on `severity` (or a composite index aligned with common filter
|
|
combinations such as `(event_type, severity, id)`). For keyword search, consider an
|
|
FTS5 virtual table over `message` and `source`, or accept the scan but document the
|
|
cost.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): `InitializeSchema` now creates
|
|
`idx_events_severity` on `site_events(severity)`, so severity-filtered queries use an
|
|
index instead of full-scanning. The leading-wildcard `LIKE` keyword search cannot use
|
|
a B-tree index by construction; rather than adding an FTS5 table, that scan is
|
|
accepted and now documented with an inline comment next to the index DDL. Regression
|
|
tests `Schema_HasIndexOnSeverity` and `SeverityFilteredQuery_UsesIndex_NotFullScan`
|
|
(the latter asserts the SQLite query plan uses `idx_events_severity` and does not
|
|
`SCAN site_events`).
|
|
|
|
### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium (partially re-triaged 2026-05-16 — see Re-triage note) |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:21-30`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:20-28`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:10-23` |
|
|
|
|
**Description**
|
|
|
|
Both `EventLogPurgeService` and `EventLogQueryService` took `ISiteEventLogger` via
|
|
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. This
|
|
made the registration fragile — any `ISiteEventLogger` that is not exactly
|
|
`SiteEventLogger` (a test double, a decorator) caused an `InvalidCastException` at
|
|
construction — and defeated the purpose of the interface abstraction.
|
|
|
|
**Re-triage note (2026-05-16)**
|
|
|
|
The finding as originally written also claimed the services "access the `internal
|
|
SqliteConnection Connection` property to run arbitrary SQL" and called itself "the
|
|
root cause of the unsynchronised connection sharing in SiteEventLogging-003". That
|
|
part is stale: the resolution of SiteEventLogging-003 had already removed the
|
|
`internal Connection` property and replaced it with lock-guarded `WithConnection`
|
|
overloads. At the time this finding was actioned, the only remaining defect was the
|
|
concrete-type downcast itself. Severity stays Medium; the description is corrected to
|
|
the downcast-only scope.
|
|
|
|
**Recommendation**
|
|
|
|
Have the purge and query services depend on the concrete `SiteEventLogger` directly
|
|
(it is the type that owns the lock-guarded `WithConnection`), and register the
|
|
concrete type in DI with the interface forwarded to the same singleton. Remove the
|
|
fragile downcasts.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): `AddSiteEventLogging` now registers
|
|
`SiteEventLogger` as a concrete singleton and forwards `ISiteEventLogger` to that
|
|
same instance. `EventLogPurgeService` and `EventLogQueryService` take a
|
|
`SiteEventLogger` constructor parameter directly, eliminating the
|
|
`(SiteEventLogger)eventLogger` downcast and its `InvalidCastException` risk. All
|
|
three services still share one connection/lock. Regression tests
|
|
`AddSiteEventLogging_ResolvesAllServices_SharingOneRecorderInstance` and
|
|
`PurgeAndQueryServices_AcceptConcreteRecorder_WithoutDowncast`.
|
|
|
|
### SiteEventLogging-008 — Event-recording write failures are silently swallowed
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |
|
|
|
|
**Description**
|
|
|
|
If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the
|
|
exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still
|
|
returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so
|
|
they cannot detect failure. The event log is the site's diagnostic audit trail; a
|
|
sustained write failure (for example a locked-database storm caused by the
|
|
unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
|
|
operators except a local log line that nobody is watching. There is no failure
|
|
counter, no health-metric hook, and no retry.
|
|
|
|
**Recommendation**
|
|
|
|
Expose a failure signal: increment a counter that the Health Monitoring component
|
|
can surface (the design notes script/alarm error rates are derived from the event
|
|
log — a logging outage should be visible). At minimum, escalate repeated failures to
|
|
a Warning/Error health metric rather than only a local log line.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): write failures are no longer swallowed. The
|
|
background writer (introduced for SiteEventLogging-005) now, on an `INSERT` failure,
|
|
(a) increments a new `Interlocked`-guarded counter exposed as the public
|
|
`SiteEventLogger.FailedWriteCount` property — which Health Monitoring can poll to
|
|
detect a logging outage — and (b) faults the `Task` returned by `LogEventAsync` with
|
|
the exception instead of returning `Task.CompletedTask`. The error is still logged.
|
|
Regression test `LogEventAsync_FaultsTask_AndCountsFailure_OnWriteError`.
|
|
|
|
### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` |
|
|
|
|
**Description**
|
|
|
|
The interface XML doc states "Record an event asynchronously." and the method is
|
|
named `LogEventAsync`, but the implementation is fully synchronous (see
|
|
SiteEventLogging-005). The documentation and naming are misleading: a reader will
|
|
reasonably assume the write is offloaded and the caller's thread is not blocked,
|
|
which is false. The `details` parameter doc says "Optional JSON details" but nothing
|
|
validates or requires JSON, so callers may pass arbitrary text.
|
|
|
|
**Re-triage note (2026-05-16)**
|
|
|
|
The finding's central premise is now stale. SiteEventLogging-005 was resolved by
|
|
introducing a dedicated background writer: `LogEventAsync` enqueues onto an unbounded
|
|
`Channel<T>` and returns without blocking, so the method is now *genuinely*
|
|
asynchronous and the name `LogEventAsync` is accurate. No rename is warranted. The
|
|
only residual documentation defect was the imprecise XML doc (the one-line summary
|
|
did not describe the enqueue/await semantics) and the `details` doc claiming "JSON"
|
|
when nothing enforces it. Scope corrected to those doc fixes only.
|
|
|
|
**Recommendation**
|
|
|
|
Align the name, signature, and documentation with the actual behaviour — either make
|
|
the method genuinely asynchronous or rename to `LogEvent` and correct the doc.
|
|
Clarify that `details` is free-form text unless JSON is actually enforced.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): the `ISiteEventLogger.LogEventAsync` XML doc
|
|
now describes the actual behaviour — the call enqueues onto a background writer and
|
|
returns without blocking, and the returned `Task` completes on durable persistence
|
|
(or faults on write failure). The `details` parameter doc was corrected to "Optional
|
|
free-form detail text ... JSON is conventional but not validated or enforced". No
|
|
rename was needed: post-SiteEventLogging-005 the method is genuinely asynchronous, so
|
|
`LogEventAsync` is accurate (see Re-triage note). Documentation-only change; no
|
|
regression test added (behaviour is already covered by the SiteEventLogging-005
|
|
tests `LogEventAsync_DoesNotBlockCaller_WhenWriteIsSlow` and
|
|
`LogEventAsync_TaskCompletes_AfterEventIsPersisted`).
|
|
|
|
### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |
|
|
|
|
**Description**
|
|
|
|
The test suite covers recording, query filtering/pagination, and basic purge, but
|
|
several critical behaviours are untested:
|
|
|
|
- `EventLogHandlerActor` has no test — the actor message contract
|
|
(`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified.
|
|
- No test exercises purge running concurrently with active writes/queries, so the
|
|
connection-sharing race (SiteEventLogging-003) is invisible to CI.
|
|
- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the
|
|
no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the
|
|
file shrinks or that only oldest events are removed under a realistic cap.
|
|
- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`,
|
|
`ErrorMessage`) has no test.
|
|
- `SiteEventLogger.Dispose` semantics (logging after dispose returns
|
|
`Task.CompletedTask`) and re-entrant dispose are untested.
|
|
|
|
**Recommendation**
|
|
|
|
Add tests for the actor bridge, a concurrency stress test (purge + write + query in
|
|
parallel), a realistic non-zero-cap purge test asserting size reduction and
|
|
oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): the remaining coverage gaps are now filled.
|
|
`EventLogHandlerActorTests` (using `Akka.TestKit.Xunit2`) exercises the actor message
|
|
contract — `EventLogQueryRequest` -> `EventLogQueryResponse` via `Sender.Tell`, for
|
|
both success and error responses. `EventLogCoverageTests` covers the previously
|
|
untested `EventLogQueryService.ExecuteQuery` catch block
|
|
(`ExecuteQuery_ReturnsFailureResponse_WhenDatabaseUnavailable`) and the recorder's
|
|
disposed-state semantics (`LogEventAsync_AfterDispose_CompletesWithoutThrowing`,
|
|
`Dispose_IsIdempotent`). The purge/write concurrency, realistic-cap, and
|
|
oldest-first behaviours were already covered by the tests added when
|
|
SiteEventLogging-001/-002/-003 were resolved
|
|
(`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection`,
|
|
`PurgeByStorageCap_StopsWhenUnderCap_DoesNotEmptyTable`,
|
|
`PurgeByStorageCap_RemovesOldestEventsFirst`).
|
|
|
|
### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` |
|
|
|
|
**Description**
|
|
|
|
`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka
|
|
actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already
|
|
implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through
|
|
this method. The placeholder is dead code: it is either never called or called with
|
|
no effect, and the comment is stale. A reader looking for where the event-log actor
|
|
is wired up will be misdirected.
|
|
|
|
**Recommendation**
|
|
|
|
Either implement the actor registration here and have `AkkaHostedService` call it
|
|
(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove
|
|
the misleading comment.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`): confirmed dead code — a repo-wide search
|
|
found zero callers of `AddSiteEventLoggingActors`, and `EventLogHandlerActor` is in
|
|
fact wired up directly in `ScadaLink.Host/Actors/AkkaHostedService.cs` as a cluster
|
|
singleton (it must be created inside the `ActorSystem` with a resolved
|
|
`IEventLogQueryService`, which a `IServiceCollection` extension cannot do). The empty
|
|
placeholder method and its stale "Phase 4+" comment were deleted, and a short
|
|
explanatory note added to `AddSiteEventLogging` pointing readers to where the actor
|
|
is actually registered. Documentation/dead-code change only; no regression test was
|
|
added — the change is a method removal verified by the compiler (no callers) and the
|
|
full module suite still passing.
|
|
|
|
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
|
|
|
|
**Description**
|
|
|
|
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
|
|
SiteEventLogging-009), "completes once the event is durably persisted and faults if
|
|
the write fails, so callers that `await` it observe success or failure." Two paths
|
|
break that contract by signalling **success** for an event that was never written:
|
|
|
|
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
|
|
has been completed because the logger was disposed), the code calls
|
|
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
|
|
though the comment immediately above acknowledges "there is nowhere to persist the
|
|
event."
|
|
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
|
|
been disposed mid-drain. The code does not inspect the returned `written` flag and
|
|
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
|
|
for an event the comment admits "simply cannot be persisted."
|
|
|
|
The event log is the site's diagnostic audit trail. A caller that `await`s
|
|
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
|
|
recorded will observe a *successful* completion for an event that was silently
|
|
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
|
|
*errors* — but the disposed-drop path was left reporting false success. The window
|
|
is the disposal/shutdown interval, during which shutdown-related events (graceful
|
|
singleton handover, instance disable) are exactly the events most likely to be
|
|
enqueued and lost.
|
|
|
|
**Recommendation**
|
|
|
|
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
|
|
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
|
|
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
|
|
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
|
|
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
|
|
deliberate "drop silently on shutdown" semantics is chosen instead.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit `<pending>`): both disposed-drop paths now fault the
|
|
returned `Task` with `ObjectDisposedException` instead of `TrySetResult()` —
|
|
`LogEventAsync` on a failed `TryWrite`, and `ProcessWriteQueueAsync` now inspects the
|
|
`written` flag from `WithConnection` and only calls `TrySetResult()` when the row was
|
|
actually persisted. An `await`-ing caller can now distinguish a dropped audit event
|
|
from a persisted one. Regression tests
|
|
`LogEventAsync_AfterDispose_FaultsTask_NotReportsSuccess` and
|
|
`LogEventAsync_EnqueuedThenDisposed_FaultsTask_WhenWriteCannotComplete` (the prior
|
|
`LogEventAsync_AfterDispose_CompletesWithoutThrowing` test asserted the now-incorrect
|
|
silent-success behaviour and was replaced).
|
|
|
|
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
|
|
|
|
**Description**
|
|
|
|
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
|
|
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
|
|
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
|
|
user-supplied keyword. A search for a literal `_` (common in event sources and
|
|
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
|
|
"match any single character", and a `%` matches any run of characters. The design
|
|
calls keyword search "free-text search on message and source fields ... Useful for
|
|
finding events by script name, alarm name, or error message" — users will reasonably
|
|
expect a literal substring match, so a query for `store_and_forward` silently returns
|
|
events containing `storeXandYforward` and similar false positives. There is no way
|
|
for the caller to search for a literal underscore or percent.
|
|
|
|
**Recommendation**
|
|
|
|
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
|
|
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
|
|
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
|
|
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
|
|
design implies.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17 (commit `<pending>`): `EventLogQueryService` now escapes `\`,
|
|
`%`, and `_` in `request.KeywordFilter` via a new `EscapeLikePattern` helper before
|
|
wrapping it in `%...%`, and the `LIKE` clauses carry an explicit `ESCAPE '\'` so the
|
|
keyword is matched as a literal substring as the design intends. Regression tests
|
|
`Query_KeywordSearch_TreatsUnderscoreAsLiteral_NotWildcard`,
|
|
`Query_KeywordSearch_TreatsPercentAsLiteral_NotWildcard`, and
|
|
`Query_KeywordSearch_StillMatchesPlainSubstring`.
|
|
|
|
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Status | Won't Fix |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
|
|
|
|
**Description**
|
|
|
|
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
|
|
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
|
|
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
|
|
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
|
|
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
|
|
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
|
|
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
|
|
near-1 GB database is back under the cap — executes inline on the startup thread,
|
|
blocking host startup (and therefore the `/health/ready` gate) for as long as the
|
|
purge takes. On a site that has accumulated a large log this can be a multi-second
|
|
stall during every node start/failover. The class doc states the service "runs on a
|
|
background thread and does not block event recording" — the startup-thread block is
|
|
inconsistent with that intent.
|
|
|
|
**Recommendation**
|
|
|
|
Yield before the initial purge so it runs on the background scheduler rather than the
|
|
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
|
|
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
|
|
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
|
|
|
|
**Re-triage note (2026-05-17)**
|
|
|
|
The finding's central premise — "the host's startup pipeline does not proceed past a
|
|
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`", so the
|
|
synchronous `RunPurge()` prelude blocks the startup thread — is **incorrect for the
|
|
.NET version this project targets**. The module targets `net10.0`. Since .NET 8,
|
|
`BackgroundService.StartAsync` no longer runs `ExecuteAsync` inline on the calling
|
|
(host startup) thread: `ExecuteAsync` is dispatched onto the thread pool, and
|
|
`StartAsync` returns immediately regardless of how long the synchronous prelude runs.
|
|
|
|
This was verified empirically. A standalone `BackgroundService` whose `ExecuteAsync`
|
|
prelude does a 1.5 s synchronous `Thread.Sleep` before its first `await` showed
|
|
`StartAsync` returning in **0 ms**, with the prelude running on a *different* thread
|
|
pool thread than the caller. Applied to `EventLogPurgeService`, `StartAsync` returns
|
|
promptly and the initial `RunPurge()` executes on the background scheduler — host
|
|
startup and the `/health/ready` gate are not blocked. There is therefore no defect
|
|
to fix.
|
|
|
|
**Resolution**
|
|
|
|
Won't Fix — 2026-05-17 (commit `<pending>`). Re-triaged: the asserted startup-thread
|
|
block cannot occur on .NET 8+ (this module targets `net10.0`), where
|
|
`BackgroundService` dispatches `ExecuteAsync` to the thread pool rather than running
|
|
its synchronous prelude on the host startup thread — verified empirically (see the
|
|
re-triage note). No code change made. A verification test
|
|
`StartAsync_DoesNotBlock_OnTheInitialPurge` was added to pin this behaviour
|
|
(asserts `StartAsync` returns in under 1 s and the initial purge still runs on the
|
|
background scheduler).
|
|
|
|
### SiteEventLogging-015 — Background write queue is unbounded; can grow without limit under sustained writer slowness
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:58-63` |
|
|
|
|
**Resolution (2026-05-28):** Switched `SiteEventLogger._writeQueue` to
|
|
`Channel.CreateBounded<PendingEvent>` with `FullMode = BoundedChannelFullMode.DropOldest`.
|
|
Default capacity 10,000 via new `SiteEventLogOptions.WriteQueueCapacity`. The
|
|
`itemDropped` callback faults the dropped event's Task with
|
|
`InvalidOperationException` and increments `FailedWriteCount` so both awaiting
|
|
callers and Health Monitoring observe drops. `DropOldest` preserves the
|
|
SiteEventLogging-005 "callers never block" guarantee. Test:
|
|
`LogEventAsync_BoundedQueueDropsOldest_AndFaultsDroppedTask`.
|
|
|
|
**Description**
|
|
|
|
`SiteEventLogger` creates its background-writer feeder as
|
|
`Channel.CreateUnbounded<PendingEvent>(...)`. The writer thread funnels every write
|
|
through the shared `_writeLock` (acquired by `WithConnection`), so any condition that
|
|
makes a single iteration slow — a long-running query in `EventLogQueryService`
|
|
holding the lock, a `PurgeByStorageCap` run that takes the lock for batched
|
|
`DELETE` + `PRAGMA incremental_vacuum`, a disk stall, or a sustained event burst
|
|
from an alarm storm / script failure loop — drives the queue arbitrarily large.
|
|
Every queued `PendingEvent` retains its `TaskCompletionSource` and its payload
|
|
strings, so there is no upper bound on how much memory the recorder can hold.
|
|
|
|
The sister centralized-audit component `ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`
|
|
addresses the same hot-path-writer problem with
|
|
`Channel.CreateBounded<...>(new BoundedChannelOptions(_options.ChannelCapacity) { ..., FullMode = BoundedChannelFullMode.Wait })`,
|
|
giving back-pressure to producers. Site event logging picked the riskier choice for
|
|
a component that — per the design — is fed by every site subsystem (script, alarm,
|
|
deployment, DCL, store-and-forward, instance lifecycle, notification) and has both
|
|
a 30-day retention sweep and a 1 GB cap-purge competing for the same lock.
|
|
|
|
**Recommendation**
|
|
|
|
Switch to `Channel.CreateBounded<PendingEvent>(...)` with a configurable capacity
|
|
(default in the order of 10 000 — large enough to absorb a normal alarm burst,
|
|
small enough to bound memory). Pick a `FullMode` that matches policy: `Wait` for
|
|
back-pressure (callers `await` and serialise their actor thread on the queue —
|
|
defeats some of the SiteEventLogging-005 win but is safe), or `DropOldest` /
|
|
`DropWrite` with a counter (drop-and-count is closer to "best-effort audit"). Add
|
|
the dropped-event counter to `FailedWriteCount` or a sibling metric. Document the
|
|
chosen policy on `ISiteEventLogger.LogEventAsync`.
|
|
|
|
### SiteEventLogging-016 — `From`/`To` filters compare non-normalised ISO 8601 strings against UTC-stored timestamps
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:67-77`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:159`, `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:72-78` |
|
|
|
|
**Resolution** — `EventLogQueryService.ExecuteQuery` now calls `.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, so the produced ISO 8601 string always ends in `+00:00` and lexicographically matches the UTC timestamps written by `SiteEventLogger`. `EventLogPurgeService.PurgeByRetention` was also made defensive with an explicit `.ToUniversalTime()` on the cutoff. A regression test (`Query_FiltersByTimeRange_HandlesNonUtcOffset`) constructs a `+05:00` `DateTimeOffset` and asserts the matching UTC-stored events are returned and out-of-range ones are excluded.
|
|
|
|
**Description**
|
|
|
|
Event rows are persisted with `timestamp` = `DateTimeOffset.UtcNow.ToString("o")`,
|
|
which always emits the round-trip ISO 8601 form ending in the literal offset
|
|
`+00:00` (e.g. `2026-05-28T12:34:56.7890123+00:00`). The query path filters by
|
|
range using a direct string comparison:
|
|
|
|
```
|
|
whereClauses.Add("timestamp >= $from");
|
|
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));
|
|
```
|
|
|
|
`request.From` is a `DateTimeOffset?` and `ToString("o")` preserves whatever offset
|
|
the caller passed in. If a central client passes a non-UTC `DateTimeOffset` — for
|
|
example the result of `DateTimeOffset.Now` in a `UTC+05:00` timezone — the produced
|
|
string is `"2026-05-28T17:34:56.0000000+05:00"`, which is lexicographically *greater*
|
|
than the equivalent UTC instant string `"2026-05-28T12:34:56.0000000+00:00"`. The
|
|
comparison `timestamp >= $from` is then evaluated as a byte-by-byte string compare
|
|
(SQLite default `BINARY` collation), so the query either spuriously excludes events
|
|
that genuinely occurred in the range, or spuriously includes events from a wholly
|
|
different hour. The same defect applies to `To`. The retention purge does
|
|
`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")` (UTC) so it is safe; only the
|
|
central query path is vulnerable.
|
|
|
|
The design explicitly states "All timestamps are UTC throughout the system" but the
|
|
boundary between a central `DateTimeOffset` and the SQLite store is not enforced.
|
|
A central UI rendered in a non-UTC timezone is the most likely trigger, and the
|
|
defect silently corrupts every query that filters by time range — exactly the
|
|
filter most likely to be set on a "show me what happened around the failover" query.
|
|
|
|
**Recommendation**
|
|
|
|
Normalise `From` / `To` to UTC before serialising:
|
|
`request.From.Value.ToUniversalTime().ToString("o")` (or
|
|
`.UtcDateTime.ToString("o")`), so the produced offset is always `+00:00`. Add a
|
|
regression test that filters with a `DateTimeOffset` carrying a non-zero offset and
|
|
asserts the matching events are returned. Optionally also store timestamps as
|
|
Unix-epoch `INTEGER` and let SQLite compare numerically, eliminating the
|
|
lexicographic-comparison hazard structurally.
|
|
|
|
### SiteEventLogging-017 — Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Security |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:55`, `src/ScadaLink.Commons/Messages/RemoteQuery/EventLogQueryRequest.cs:18` |
|
|
|
|
**Resolution (2026-05-28):** `EventLogQueryService.ExecuteQuery` now clamps
|
|
`pageSize` to new `SiteEventLogOptions.MaxQueryPageSize` (default 500, matching
|
|
existing `QueryPageSize` default) before issuing the SQL. Silent clamp rather
|
|
than reject so misconfigured clients still get a usable response. Test:
|
|
`Query_PageSize_IsClampedToMaxQueryPageSize` (asserts a request of 100,000 is
|
|
clamped down with `HasMore = true`).
|
|
|
|
**Description**
|
|
|
|
`EventLogQueryService.ExecuteQuery` resolves the effective page size as
|
|
`var pageSize = request.PageSize > 0 ? request.PageSize : _options.QueryPageSize;`
|
|
and uses it directly as the SQL `LIMIT $limit` (passing `pageSize + 1` to detect
|
|
"has more"). There is no upper bound. A central client — buggy or hostile — can
|
|
send `PageSize = int.MaxValue`, in which case the query attempts to materialise the
|
|
entire (up to 1 GB) event log into a single `List<EventLogEntry>` while holding the
|
|
shared write lock. This:
|
|
|
|
- Builds a worst-case ~1 GB managed allocation that, depending on Akka.NET cluster
|
|
message serialisation limits, will then be serialised into an
|
|
`EventLogQueryResponse` and pushed over the ClusterClient pipe.
|
|
- Blocks all writes (purge, recorder hot path) for the duration of the scan
|
|
because the read holds `_writeLock`.
|
|
- Stalls the singleton `EventLogHandlerActor`, also blocking subsequent legitimate
|
|
queries.
|
|
|
|
The design explicitly justifies pagination as preventing exactly this — "Results
|
|
are paginated with a configurable page size (default: 500 events) ... This prevents
|
|
broad queries from overwhelming the communication channel." The code honours the
|
|
*default* but does not enforce an *upper bound* on a client-supplied override.
|
|
|
|
**Recommendation**
|
|
|
|
Clamp `pageSize` to a configurable maximum (e.g. `SiteEventLogOptions.MaxQueryPageSize`,
|
|
default 5000) before using it. Also bound `KeywordFilter.Length` (e.g. 256 chars) —
|
|
a leading-wildcard `LIKE` of an unbounded pattern is itself an expensive operation
|
|
that runs under the same lock. Add a `Success: false, ErrorMessage: "PageSize
|
|
exceeds maximum"` reject path so a misbehaving central is told why its query is
|
|
refused.
|
|
|
|
### SiteEventLogging-018 — `FailedWriteCount` is exposed but never consumed by Health Monitoring
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:67-71,225-226` |
|
|
|
|
**Description**
|
|
|
|
`SiteEventLogger.FailedWriteCount` was added under SiteEventLogging-008 with the
|
|
XML doc statement "Surfaced so Health Monitoring can detect a logging outage
|
|
instead of relying on a local log line nobody is watching." The implementation is
|
|
correct (`Interlocked.Increment` on write failure, `Interlocked.Read` getter), but
|
|
a repo-wide search shows **no** caller anywhere in `src/` reads the property —
|
|
neither `ScadaLink.HealthMonitoring`, the central health collector, nor the host's
|
|
`/health` endpoint. The metric is dead-letter: a logging outage still goes
|
|
unnoticed in production, contradicting the original finding's resolution claim.
|
|
|
|
The property is also exposed only on the concrete `SiteEventLogger`, not on
|
|
`ISiteEventLogger`, so even if Health Monitoring were wired up it would have to
|
|
take a concrete-type dependency (`internal Connection` removed, but
|
|
`FailedWriteCount` remained concrete-only).
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) wire `FailedWriteCount` into the existing Health Monitoring metric
|
|
pipeline (e.g. publish it alongside other 30-second-interval site metrics, and
|
|
promote a sustained non-zero value to a Warning), and add it to `ISiteEventLogger`
|
|
so the consumer doesn't downcast; or (b) acknowledge the metric is unobserved by
|
|
softening the XML doc to "Available for future Health Monitoring integration" and
|
|
file a tracking item for the wiring. The current doc claim is misleading.
|
|
|
|
### SiteEventLogging-019 — `EventLogPurgeService` runs on every host node; design says "active node"
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:57-95`, `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:30-39`, `docs/requirements/Component-SiteEventLogging.md:45` |
|
|
|
|
**Description**
|
|
|
|
`AddSiteEventLogging` calls `services.AddHostedService<EventLogPurgeService>()`,
|
|
which registers the purge `BackgroundService` per host. On a 2-node site cluster
|
|
both `node-a` and `node-b` start the service independently, so each runs its own
|
|
30-day retention purge and 1 GB cap purge against its own local
|
|
`site_events.db`. The design states only "A daily background job runs on the
|
|
active node and deletes all events older than 30 days." (Component-SiteEventLogging,
|
|
Storage section). In practice the standby node receives no writes, so its purge
|
|
finds nothing to delete and is harmless — but the implementation does not match the
|
|
documented "active node" gating, and the resolution note on SiteEventLogging-004
|
|
already flagged that the *writer* runs on the standby too. The purge has the same
|
|
shape.
|
|
|
|
Aligning to the design is also a defence against a future change that does write
|
|
to the standby (e.g. local heartbeats), and removes the per-node wake-ups that
|
|
contribute to `Microsoft.Extensions.Hosting` shutdown latency.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) gate the purge service on "this node is the active member of `siteRole`"
|
|
(check the cluster singleton ownership before each `RunPurge()`, or host the
|
|
purge inside the same cluster singleton as `EventLogHandlerActor`), or (b) reword
|
|
the design doc to "the purge runs on every node against its own local database;
|
|
on the standby it is a no-op". Pick one; the current mismatch is a doc-vs-code
|
|
defect.
|
|
|
|
**Resolution (2026-05-28):** Took option (a) at the loop level — registration
|
|
stays unchanged on every host. Introduced a `SiteEventLogActiveNodeCheck`
|
|
delegate that `EventLogPurgeService` consults at the top of every
|
|
`RunPurge()` tick; standby returns early with a debug log. The DI factory
|
|
resolves the delegate from the container so the Host can register the real
|
|
check on a site node, and a null/unregistered delegate falls back to the
|
|
prior "always run" behaviour (backward compatible for non-clustered hosts and
|
|
existing tests). Defensive try/catch around the check defaults to "run" so a
|
|
transient cluster-state read failure cannot stop the purge loop. Added tests
|
|
`RunPurge_OnStandbyNode_SkipsAllWork`,
|
|
`RunPurge_OnActiveNode_RunsTheRetentionPurge`, and
|
|
`RunPurge_WithNullCheck_FallsBackToRunning`. Wiring the real check on the
|
|
Host's site-role branch is left for the Host's review. Tests green (50/50 in
|
|
SiteEventLogging.Tests).
|
|
|
|
### SiteEventLogging-020 — `severity` and `eventType` are unvalidated free-form strings; doc enumerates a set that is not enforced
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:144-156`, `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:14-15` |
|
|
|
|
**Resolution (2026-05-28):** `LogEventAsync` now validates `severity` against
|
|
the closed set `{Info, Warning, Error}` (case-sensitive, matches SQLite
|
|
BINARY collation used by the query filter) and throws `ArgumentException`
|
|
naming the offending value. `eventType` left intentionally free-form (design
|
|
enumerates an open category set). Tests: `LogEventAsync_ThrowsOnUnknownSeverity`
|
|
(5 cases) and `LogEventAsync_AcceptsAllDocumentedSeverities` (3 cases).
|
|
|
|
**Description**
|
|
|
|
`LogEventAsync` validates `eventType` and `severity` only for non-empty/non-whitespace.
|
|
The XML doc enumerates the allowed values: `eventType` ∈ {script, alarm,
|
|
deployment, connection, store_and_forward, instance_lifecycle}, `severity` ∈
|
|
{Info, Warning, Error}. Nothing in the code enforces either set. Any caller can
|
|
pass `"SCRIPT"`, `"Script"`, `"warn"`, `"ERR"`, or a typo and the row is inserted
|
|
verbatim. Two follow-on consequences:
|
|
|
|
1. The `EventLogQueryService.Severity` filter is `severity = $severity` (exact
|
|
match, case-sensitive by SQLite default `BINARY` collation). A row stored as
|
|
`"error"` will not be returned for a query filtering on `"Error"`. The design
|
|
lists severity as a first-class filter and the central UI will reasonably
|
|
normalise to one casing — every row stored with a different casing is silently
|
|
invisible to that filter.
|
|
2. The `Events Logged` table in the design implicitly relies on a stable
|
|
`event_type` enumeration to drive UI grouping; a typo'd `event_type` slips in
|
|
silently and is hard to detect later.
|
|
|
|
**Recommendation**
|
|
|
|
Validate `eventType` and `severity` against a known set (or accept `enum`s on the
|
|
interface, converting to canonical string at the call site). Reject unknown values
|
|
with `ArgumentException` and log a single-shot warning during construction if a
|
|
deployment is found to be using an unexpected value. Alternatively, normalise
|
|
casing (`severity = severity.ToLowerInvariant()`) so the query filter is
|
|
case-insensitive. Update the XML doc to match the enforced contract.
|
|
|
|
### SiteEventLogging-021 — `DateTimeOffset.Parse` uses the current culture; can throw on non-default locales
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:138` |
|
|
|
|
**Resolution (2026-05-28):** `EventLogQueryService.ExecuteQuery` now parses the
|
|
stored ISO 8601 timestamp with `CultureInfo.InvariantCulture` plus
|
|
`DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal`, so the
|
|
result is locale-independent and guaranteed UTC. No new test was added — the
|
|
recorder still writes via `DateTimeOffset.UtcNow.ToString("o")` which is itself
|
|
invariant-safe in practice, and the existing query/time-range tests continue
|
|
to pass.
|
|
|
|
**Description**
|
|
|
|
`ExecuteQuery` materialises rows via
|
|
`DateTimeOffset.Parse(reader.GetString(1))`. `DateTimeOffset.Parse(string)` uses
|
|
`CultureInfo.CurrentCulture` and `DateTimeStyles.None`. The stored format is ISO
|
|
8601 round-trip (`"o"`), which is *usually* parseable in any culture — but a
|
|
production node running with a non-default culture (e.g. Turkish "tr-TR", which
|
|
has historically broken case-insensitive ASCII comparisons via the
|
|
"Turkish-I" issue, or any culture that overrides the date/time separators) can
|
|
parse incorrectly or throw `FormatException`. The exception is caught by the outer
|
|
`try`, so the entire query is converted to a `Success: false` response — but the
|
|
failure mode is silent and culture-dependent.
|
|
|
|
The recorder side stores via `DateTimeOffset.UtcNow.ToString("o")`, which is also
|
|
culture-sensitive in the same way; on a hostile-culture node, the round-trip
|
|
between insert and query is not guaranteed to be lossless without explicit
|
|
culture pinning.
|
|
|
|
**Recommendation**
|
|
|
|
Parse with explicit invariant culture and round-trip style:
|
|
`DateTimeOffset.Parse(reader.GetString(1), CultureInfo.InvariantCulture,
|
|
DateTimeStyles.RoundtripKind)` (and the same for the `ToString("o", InvariantCulture)`
|
|
emitters in `SiteEventLogger.LogEventAsync` and `EventLogPurgeService.PurgeByRetention`).
|
|
Alternatively switch the schema to store `timestamp` as Unix-epoch `INTEGER` and
|
|
avoid all string-parsing.
|
|
|
|
### SiteEventLogging-022 — `Cache=Shared` is redundant for a single-connection logger
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:52` |
|
|
|
|
**Description**
|
|
|
|
The connection string is built as
|
|
`$"Data Source={options.Value.DatabasePath};Cache=Shared"`. SQLite's
|
|
shared-cache mode is a *cross-connection* optimisation: it lets multiple
|
|
`SqliteConnection`s in the same process share an in-process page cache. This
|
|
logger owns exactly one `SqliteConnection` and serialises all access through
|
|
`_writeLock`, so `Cache=Shared` cannot share with anything — the mode is dormant.
|
|
At best it is dead configuration; at worst it adds (very small) per-statement
|
|
lock overhead inside SQLite. The sister `SqliteAuditWriter` carries the same
|
|
unused option, so the smell is a copy-and-paste pattern.
|
|
|
|
Shared-cache mode also subtly changes the semantics of `PRAGMA busy_timeout` and
|
|
`PRAGMA locking_mode`, so leaving it on while *not* using it is a small future-foot
|
|
gun if anyone later opens a second connection to the same file from another
|
|
component on the same host (e.g. a tooling read-only viewer).
|
|
|
|
**Recommendation**
|
|
|
|
Drop `Cache=Shared` from the connection string — the logger is single-connection
|
|
and gains nothing from it. If a future need to share the DB across connections in
|
|
the same process arises, reintroduce it deliberately together with the busy_timeout
|
|
and locking_mode review that should accompany it.
|
|
|
|
### SiteEventLogging-023 — Concurrent-stress test uses a non-volatile `stop` flag
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Open |
|
|
| Location | `tests/ScadaLink.SiteEventLogging.Tests/EventLogPurgeServiceTests.cs:282-308` |
|
|
|
|
**Description**
|
|
|
|
`PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` uses a plain `bool stop = false;`
|
|
that the main test thread mutates after the purge task completes
|
|
(`stop = true;`) while four background writer tasks are spin-checking `while (!stop)`.
|
|
The flag is not declared `volatile`, not wrapped in `Volatile.Read/Volatile.Write`,
|
|
and not behind a memory barrier. On a release build with a relaxed memory model
|
|
the writer threads are permitted to cache the `stop = false` read indefinitely,
|
|
which means in theory the test can hang past xUnit's per-test timeout instead of
|
|
asserting `Empty(exceptions)`. The test relies on observed JIT/runtime behaviour
|
|
that today happens to refresh the field across the `await _eventLogger.LogEventAsync`
|
|
boundary, but that is an implementation detail rather than a contract.
|
|
|
|
The test is a regression test for SiteEventLogging-003; a flaky / hang-prone
|
|
version of it can mask the very behaviour it is meant to pin.
|
|
|
|
**Recommendation**
|
|
|
|
Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or change
|
|
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
|
|
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
|
|
cooperate with xUnit's `Task.WhenAll` timeout.
|