Commit Graph

9 Commits

Author SHA1 Message Date
Joseph Doherty
0da4f3b63a fix(core-alarm-historian): resolve Low code-review findings (Core.AlarmHistorian-008,011)
- Core.AlarmHistorian-008: cache queue depth in an Interlocked counter so
  EnqueueAsync no longer runs COUNT(*) on every alarm; consolidate
  DrainOnceAsync onto a single SqliteConnection per tick (purge, batch
  read, dead-letter, and outcome transaction all share it).
- Core.AlarmHistorian-011: confirm the stale Galaxy.Host XML doc
  references were already fixed under earlier commits; flip to Resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 05:38:26 -04:00
Joseph Doherty
cec7ab6ec4 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-010)
The test suite lacked coverage for four critical paths: corrupt/null-
deserializing PayloadJson rows, StartDrainLoop timer behavior and backoff
honoring, concurrent EnqueueAsync+DrainOnceAsync stress, and the
outcomes.Count != events.Count cardinality-mismatch branch.

Added tests covering all four gaps (committed across companion findings):
- Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned
- Drain_with_corrupt_head_row_does_not_stall_queue
- StartDrainLoop_honors_backoff_and_slows_cadence_under_retry
- StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy
- StartDrainLoop_records_drain_fault_and_keeps_running
- Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy
- Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows
- Capacity_eviction_increments_evicted_count_on_status
- GetStatus_snapshot_is_consistent_under_concurrent_drain

Updated Open findings count to 2 (Core.AlarmHistorian-008 + -011, both Low).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:30:17 -04:00
Joseph Doherty
f6d487b167 fix(driver-historian-wonderware-client): suppress xUnit1051 false-positive in ContractsWireParityTests
Add #pragma warning disable xUnit1051 at the top of ContractsWireParityTests.cs.
The xUnit1051 analyser fires on MessagePack's Serialize/Deserialize overloads that
have an optional CancellationToken parameter; these are synchronous parity tests
where the token is not meaningful — the suppression is scoped to this file only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:28:20 -04:00
Joseph Doherty
5718cb5778 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-007)
When WriteBatchAsync returned a wrong-cardinality outcome list, DrainOnceAsync
threw InvalidOperationException after potential delivery — causing duplicate
events on re-drain or permanent queue stall on a deterministic writer bug.

- The throw replaced with log + backoff: mismatch is recorded into _lastError,
  _drainState set to BackingOff, backoff bumped, method returns without applying
  any outcomes, mirroring the writer-exception path.
- Regression test Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows
  asserts rows stay queued, DrainState = BackingOff, LastError populated, and
  that a fixed writer subsequently drains cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:55 -04:00
Joseph Doherty
6d520c6756 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-005)
Status fields (_lastDrainUtc, _lastSuccessUtc, _lastError, _drainState,
_evictedCount) were written by the drain timer thread and read by
GetStatus() / health-check threads with no memory barrier, risking torn
DateTime? reads and stale DrainState observations.

- Added _statusLock object; all writes to status fields now happen inside
  lock(_statusLock) blocks in DrainOnceAsync and DrainTimerCallback.
- GetStatus() snapshots all fields atomically under the same lock so the
  Admin UI / /healthz endpoint always sees a consistent view.
- Regression test GetStatus_snapshot_is_consistent_under_concurrent_drain
  drives status writes and reads from concurrent threads; asserts no throws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:31 -04:00
Joseph Doherty
0cc3b23101 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-003)
EnqueueAsync used synchronous SQLite I/O (conn.Open / ExecuteNonQuery /
COUNT(*)) on the caller's thread, blocking the alarm-emitting thread under
write contention with the drain worker. The cancellationToken parameter was
silently ignored.

- EnqueueAsync converted to genuine async: OpenAsync / ExecuteNonQueryAsync /
  ExecuteScalarAsync used throughout; ct threaded to every await.
- ApplyPragmasAsync added alongside the existing ApplyPragmas helper so
  the WAL + busy_timeout PRAGMAs are applied on the async open path too.
- EnforceCapacityAsync added to handle capacity eviction on the async path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:14 -04:00
Joseph Doherty
4638366b77 fix(alarm-historian): resolve High code-review findings (Core.AlarmHistorian-002, -004, -006)
Core.AlarmHistorian-002 — drain loop now honors exponential backoff:
StartDrainLoop arms a self-rescheduling one-shot Timer. RescheduleDrain
sets the next due-time to max(tickInterval, CurrentBackoff) while the
sink is BackingOff, so a historian outage genuinely slows the cadence
down the 1s->2s->5s->15s->60s ladder instead of hammering at the fixed
tick. Class doc-comment updated.

Core.AlarmHistorian-004 — SQLite busy handling: the connection string
is built via SqliteConnectionStringBuilder with DefaultTimeout=5, and a
new OpenConnection helper applies PRAGMA busy_timeout=5000 and
PRAGMA journal_mode=WAL on every open. A concurrent enqueue-vs-drain
file-lock collision now waits the lock out instead of failing fast with
SQLITE_BUSY. All connection open sites switched to the helper.

Core.AlarmHistorian-006 — drain-loop faults are no longer unobserved:
the timer callback (DrainTimerCallback) awaits DrainOnceAsync inside a
try/catch that logs via _logger.Error, records the message into
_lastError, and sets _drainState=BackingOff so a stalled drain is
visible on GetStatus; a finally always re-arms the timer.

Regression tests added to SqliteStoreAndForwardSinkTests:
StartDrainLoop_honors_backoff_and_slows_cadence_under_retry,
StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy,
StartDrainLoop_records_drain_fault_and_keeps_running,
Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy.

findings.md: 002/004/006 marked Resolved; open count 10 -> 7.

Build: clean (0 warnings). Tests: 20/20 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:39 -04:00
Joseph Doherty
796871c210 fix(alarm-historian): keep queue rows aligned to events on drain (Core.AlarmHistorian-001)
ReadBatch built parallel rowIds / events lists: rowIds.Add ran for every
row but events.Add was guarded by `if (evt is not null)`. A corrupt /
null-deserializing payload desynced the lists, so DrainOnceAsync applied
each outcome to the wrong RowId — an Ack could delete an un-sent event
(silent alarm-event data loss) and the corrupt row stalled the queue
head forever.

ReadBatch now returns a single list of QueueRow(long RowId,
AlarmHistorianEvent? Event) records so a rowId can never drift from its
event; deserialization is wrapped to yield null on JsonException.
DrainOnceAsync immediately dead-letters rows whose payload is
null/un-deserializable and forwards only well-formed events to the
writer, mapping outcomes by RowId.

Regression tests cover a corrupt row mid-batch and at the queue head.
Core.AlarmHistorian suite: 16/16 pass.

Resolves code-review finding Core.AlarmHistorian-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:54:20 -04:00
Joseph Doherty
8568f5cd85 docs(code-reviews): comprehensive per-module review pass at 76d35d1
Reviewed all 31 src/ production projects against the 10-category
checklist in REVIEW-PROCESS.md. Each module gets its own findings.md;
code-reviews/README.md is regenerated from them.

334 findings: 6 Critical, 46 High, 126 Medium, 156 Low.

Critical findings:
- Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead
  on an unresolvable node crashes the process (remote DoS).
- Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView)
  plus unauthenticated mutating routes.
- Core.Scripting-001: System.Environment reachable from operator scripts;
  Environment.Exit() terminates the server.
- Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt
  payload misapplies outcomes — silent alarm-event data loss.
- Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so
  a transient gateway drop permanently kills the event stream.

All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md
section 4. Review only — no source code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:20:27 -04:00