- Core.AlarmHistorian-008: cache queue depth in an Interlocked counter so
EnqueueAsync no longer runs COUNT(*) on every alarm; consolidate
DrainOnceAsync onto a single SqliteConnection per tick (purge, batch
read, dead-letter, and outcome transaction all share it).
- Core.AlarmHistorian-011: confirm the stale Galaxy.Host XML doc
references were already fixed under earlier commits; flip to Resolved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test suite lacked coverage for four critical paths: corrupt/null-
deserializing PayloadJson rows, StartDrainLoop timer behavior and backoff
honoring, concurrent EnqueueAsync+DrainOnceAsync stress, and the
outcomes.Count != events.Count cardinality-mismatch branch.
Added tests covering all four gaps (committed across companion findings):
- Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned
- Drain_with_corrupt_head_row_does_not_stall_queue
- StartDrainLoop_honors_backoff_and_slows_cadence_under_retry
- StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy
- StartDrainLoop_records_drain_fault_and_keeps_running
- Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy
- Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows
- Capacity_eviction_increments_evicted_count_on_status
- GetStatus_snapshot_is_consistent_under_concurrent_drain
Updated Open findings count to 2 (Core.AlarmHistorian-008 + -011, both Low).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add #pragma warning disable xUnit1051 at the top of ContractsWireParityTests.cs.
The xUnit1051 analyser fires on MessagePack's Serialize/Deserialize overloads that
have an optional CancellationToken parameter; these are synchronous parity tests
where the token is not meaningful — the suppression is scoped to this file only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When WriteBatchAsync returned a wrong-cardinality outcome list, DrainOnceAsync
threw InvalidOperationException after potential delivery — causing duplicate
events on re-drain or permanent queue stall on a deterministic writer bug.
- The throw replaced with log + backoff: mismatch is recorded into _lastError,
_drainState set to BackingOff, backoff bumped, method returns without applying
any outcomes, mirroring the writer-exception path.
- Regression test Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows
asserts rows stay queued, DrainState = BackingOff, LastError populated, and
that a fixed writer subsequently drains cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Status fields (_lastDrainUtc, _lastSuccessUtc, _lastError, _drainState,
_evictedCount) were written by the drain timer thread and read by
GetStatus() / health-check threads with no memory barrier, risking torn
DateTime? reads and stale DrainState observations.
- Added _statusLock object; all writes to status fields now happen inside
lock(_statusLock) blocks in DrainOnceAsync and DrainTimerCallback.
- GetStatus() snapshots all fields atomically under the same lock so the
Admin UI / /healthz endpoint always sees a consistent view.
- Regression test GetStatus_snapshot_is_consistent_under_concurrent_drain
drives status writes and reads from concurrent threads; asserts no throws.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EnqueueAsync used synchronous SQLite I/O (conn.Open / ExecuteNonQuery /
COUNT(*)) on the caller's thread, blocking the alarm-emitting thread under
write contention with the drain worker. The cancellationToken parameter was
silently ignored.
- EnqueueAsync converted to genuine async: OpenAsync / ExecuteNonQueryAsync /
ExecuteScalarAsync used throughout; ct threaded to every await.
- ApplyPragmasAsync added alongside the existing ApplyPragmas helper so
the WAL + busy_timeout PRAGMAs are applied on the async open path too.
- EnforceCapacityAsync added to handle capacity eviction on the async path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Core.AlarmHistorian-002 — drain loop now honors exponential backoff:
StartDrainLoop arms a self-rescheduling one-shot Timer. RescheduleDrain
sets the next due-time to max(tickInterval, CurrentBackoff) while the
sink is BackingOff, so a historian outage genuinely slows the cadence
down the 1s->2s->5s->15s->60s ladder instead of hammering at the fixed
tick. Class doc-comment updated.
Core.AlarmHistorian-004 — SQLite busy handling: the connection string
is built via SqliteConnectionStringBuilder with DefaultTimeout=5, and a
new OpenConnection helper applies PRAGMA busy_timeout=5000 and
PRAGMA journal_mode=WAL on every open. A concurrent enqueue-vs-drain
file-lock collision now waits the lock out instead of failing fast with
SQLITE_BUSY. All connection open sites switched to the helper.
Core.AlarmHistorian-006 — drain-loop faults are no longer unobserved:
the timer callback (DrainTimerCallback) awaits DrainOnceAsync inside a
try/catch that logs via _logger.Error, records the message into
_lastError, and sets _drainState=BackingOff so a stalled drain is
visible on GetStatus; a finally always re-arms the timer.
Regression tests added to SqliteStoreAndForwardSinkTests:
StartDrainLoop_honors_backoff_and_slows_cadence_under_retry,
StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy,
StartDrainLoop_records_drain_fault_and_keeps_running,
Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy.
findings.md: 002/004/006 marked Resolved; open count 10 -> 7.
Build: clean (0 warnings). Tests: 20/20 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ReadBatch built parallel rowIds / events lists: rowIds.Add ran for every
row but events.Add was guarded by `if (evt is not null)`. A corrupt /
null-deserializing payload desynced the lists, so DrainOnceAsync applied
each outcome to the wrong RowId — an Ack could delete an un-sent event
(silent alarm-event data loss) and the corrupt row stalled the queue
head forever.
ReadBatch now returns a single list of QueueRow(long RowId,
AlarmHistorianEvent? Event) records so a rowId can never drift from its
event; deserialization is wrapped to yield null on JsonException.
DrainOnceAsync immediately dead-letters rows whose payload is
null/un-deserializable and forwards only well-formed events to the
writer, mapping outcomes by RowId.
Regression tests cover a corrupt row mid-batch and at the queue head.
Core.AlarmHistorian suite: 16/16 pass.
Resolves code-review finding Core.AlarmHistorian-001 (Critical).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewed all 31 src/ production projects against the 10-category
checklist in REVIEW-PROCESS.md. Each module gets its own findings.md;
code-reviews/README.md is regenerated from them.
334 findings: 6 Critical, 46 High, 126 Medium, 156 Low.
Critical findings:
- Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead
on an unresolvable node crashes the process (remote DoS).
- Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView)
plus unauthenticated mutating routes.
- Core.Scripting-001: System.Environment reachable from operator scripts;
Environment.Exit() terminates the server.
- Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt
payload misapplies outcomes — silent alarm-event data loss.
- Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so
a transient gateway drop permanently kills the event stream.
All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md
section 4. Review only — no source code changed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>