fix(alarm-historian): resolve High code-review findings (Core.AlarmHistorian-002, -004, -006)

Core.AlarmHistorian-002 — drain loop now honors exponential backoff:
StartDrainLoop arms a self-rescheduling one-shot Timer. RescheduleDrain
sets the next due-time to max(tickInterval, CurrentBackoff) while the
sink is BackingOff, so a historian outage genuinely slows the cadence
down the 1s->2s->5s->15s->60s ladder instead of hammering at the fixed
tick. Class doc-comment updated.

Core.AlarmHistorian-004 — SQLite busy handling: the connection string
is built via SqliteConnectionStringBuilder with DefaultTimeout=5, and a
new OpenConnection helper applies PRAGMA busy_timeout=5000 and
PRAGMA journal_mode=WAL on every open. A concurrent enqueue-vs-drain
file-lock collision now waits the lock out instead of failing fast with
SQLITE_BUSY. All connection open sites switched to the helper.

Core.AlarmHistorian-006 — drain-loop faults are no longer unobserved:
the timer callback (DrainTimerCallback) awaits DrainOnceAsync inside a
try/catch that logs via _logger.Error, records the message into
_lastError, and sets _drainState=BackingOff so a stalled drain is
visible on GetStatus; a finally always re-arms the timer.

Regression tests added to SqliteStoreAndForwardSinkTests:
StartDrainLoop_honors_backoff_and_slows_cadence_under_retry,
StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy,
StartDrainLoop_records_drain_fault_and_keeps_running,
Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy.

findings.md: 002/004/006 marked Resolved; open count 10 -> 7.

Build: clean (0 warnings). Tests: 20/20 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 06:19:55 -04:00
parent 6300a9e4a8
commit 4638366b77
3 changed files with 238 additions and 29 deletions

View File

@@ -7,7 +7,7 @@
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 10 |
| Open findings | 7 |
## Checklist coverage
@@ -48,13 +48,13 @@
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:99-105,386-388` |
| Status | Open |
| Status | Resolved |
**Description:** The class computes an exponential-backoff value (`_backoffIndex`, `BumpBackoff`, `CurrentBackoff`, the `BackoffLadder`) and the class doc-comment states "Drain runs on a shared `Timer`. Exponential backoff on `RetryPlease`: 1s → 2s → 5s → 15s → 60s cap." However `StartDrainLoop` creates the `Timer` with a fixed `tickInterval` for both due-time and period and never reschedules it. `CurrentBackoff` is computed but never consulted by the timer, so the drain loop keeps hammering the historian at the fixed cadence regardless of `BackingOff` state. The documented backoff behavior does not exist for the production drain path — it is only observable via the `CurrentBackoff` property in tests.
**Recommendation:** Make the drain loop honor the backoff. Either switch to a self-rescheduling one-shot timer that sets its next due-time to `max(tickInterval, CurrentBackoff)` after each `DrainOnceAsync`, or have `DrainOnceAsync` skip the writer call while still inside the backoff window (track `_nextEligibleDrainUtc`). Update the doc-comment if the design intentionally changes.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — `StartDrainLoop` now arms a self-rescheduling one-shot `Timer`; `RescheduleDrain` sets the next due-time to `max(tickInterval, CurrentBackoff)` while `_drainState` is `BackingOff` so a historian outage genuinely slows the cadence down the ladder. Class doc-comment updated. Regression tests `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry` and `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy` added.
### Core.AlarmHistorian-003
@@ -78,13 +78,13 @@
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:90,112,176,259` |
| Status | Open |
| Status | Resolved |
**Description:** Every operation opens a brand-new `SqliteConnection` from the bare connection string `Data Source={databasePath}` — no `busy_timeout` / `Pragma`, no shared cache. SQLite serializes writers with a file lock; when `EnqueueAsync` (emitting thread) and `DrainOnceAsync` (timer thread) collide, the loser gets an immediate `SQLITE_BUSY` exception because the default busy timeout is 0. In `DrainOnceAsync` the `BeginTransaction()` / `Commit()` block can fail mid-drain with `SQLITE_BUSY`; the exception escapes the `try` (it is not the writer-call `try`), the `finally` releases the gate, and the row outcomes are lost / partially applied. The class doc-comment claims `DrainOnceAsync` is "Safe to call from multiple threads" but the concurrent enqueue-vs-drain case is not actually safe against busy errors.
**Recommendation:** Configure a non-zero busy timeout — `SqliteConnectionStringBuilder { DataSource = databasePath, DefaultTimeout = 5 }` plus `PRAGMA busy_timeout=5000` on open. Strongly consider WAL journal mode (`PRAGMA journal_mode=WAL`) so readers and the writer do not block each other. Reuse a single long-lived write connection guarded by `_drainGate` rather than opening/closing per call.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — the connection string is now built via `SqliteConnectionStringBuilder` with `DefaultTimeout = 5`, and every connection is opened through a new `OpenConnection` helper that applies `PRAGMA busy_timeout=5000` and `PRAGMA journal_mode=WAL` so an enqueue/drain lock collision waits the lock out instead of throwing `SQLITE_BUSY`. All eight call sites switched to the helper. Regression test `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` added.
### Core.AlarmHistorian-005
@@ -108,13 +108,13 @@
| Severity | High |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216` |
| Status | Open |
| Status | Resolved |
**Description:** `StartDrainLoop` launches the drain with `new Timer(_ => _ = DrainOnceAsync(CancellationToken.None), ...)`. The returned `Task` is discarded (`_ =`), so any exception thrown by `DrainOnceAsync` is an unobserved task exception — never logged, never surfaced. Several paths in `DrainOnceAsync` can throw: the `outcomes.Count != events.Count` guard (`InvalidOperationException`), `JsonSerializer.Deserialize` on a malformed payload, `PurgeAgedDeadLetters` / `ReadBatch` / the commit block hitting `SQLITE_BUSY` or a schema error. When any of these throw, the drain silently stops making progress for that tick, `_drainState` is left stale (still `Draining`), and an operator watching the Admin UI sees no error. A persistently failing condition produces a silent, permanently stalled queue.
**Recommendation:** Wrap the timer callback body in a `try/catch` that logs the exception via `_logger.Error`, records it into `_lastError`, and resets `_drainState` so the diagnostics surface reflects the failure. Do not discard the `Task` without an attached continuation that observes faults.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-22 — the timer no longer discards the drain `Task`. A dedicated `DrainTimerCallback` `await`s `DrainOnceAsync` inside a `try/catch` that logs the fault via `_logger.Error`, records it into `_lastError`, and sets `_drainState = BackingOff` so the failure is visible on the `GetStatus` surface; a `finally` always re-arms the timer so a faulting tick can never permanently stall the queue. Regression test `StartDrainLoop_records_drain_fault_and_keeps_running` added.
### Core.AlarmHistorian-007