lmxopcua/code-reviews/Core.AlarmHistorian/findings.md

# Code Review — Core.AlarmHistorian

| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |

## Checklist coverage

| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.AlarmHistorian-001, Core.AlarmHistorian-002 |
| 2 | OtOpcUa conventions | Core.AlarmHistorian-003 |
| 3 | Concurrency & thread safety | Core.AlarmHistorian-004, Core.AlarmHistorian-005 |
| 4 | Error handling & resilience | Core.AlarmHistorian-006, Core.AlarmHistorian-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.AlarmHistorian-008 |
| 7 | Design-document adherence | Core.AlarmHistorian-009 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Core.AlarmHistorian-010 |
| 10 | Documentation & comments | Core.AlarmHistorian-011 |

## Findings

### Core.AlarmHistorian-001

| Field | Value |
|---|---|
| Severity | Critical |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:255-278` |
| Status | Resolved |

**Description:** `ReadBatch` builds two parallel lists, `rowIds` and `events`, that `DrainOnceAsync` later indexes together (`rowIds[i]` paired with `outcomes[i]`, where `outcomes` is 1:1 with `events`). But `rowIds.Add(reader.GetInt64(0))` runs unconditionally for every row, while `events.Add(evt)` is guarded by `if (evt is not null)`. If `JsonSerializer.Deserialize<AlarmHistorianEvent>` returns `null` for any row (corrupt or empty payload), `rowIds` gains an entry but `events` does not. The writer then returns `outcomes.Count == events.Count`, which passes the `outcomes.Count != events.Count` guard, and the per-row loop applies each outcome to `rowIds[i]` — every row from the skipped index onward is mapped to the wrong event's outcome. An `Ack` can delete a row whose event was never sent to the historian (silent alarm-event data loss), and a `PermanentFail` can dead-letter an unrelated good row. The corrupt row itself is never advanced and is re-read on every drain forever, permanently stalling the queue head.

**Recommendation:** Keep `rowIds` and `events` strictly aligned. Either skip the `rowId` when deserialization returns `null`, or — better — treat a `null`/failed deserialization as an immediate dead-letter for that specific `RowId` (it can never succeed) and exclude it from the batch passed to the writer. Carry the `rowId` inside a single list of `(long RowId, AlarmHistorianEvent Event)` tuples so the two can never drift.

**Resolution:** Resolved 2026-05-22 — `ReadBatch` now returns a single list of `QueueRow(long RowId, AlarmHistorianEvent? Event)` records so a rowId can never drift from its event; `DrainOnceAsync` immediately dead-letters rows whose payload is null/un-deserializable (also catching `JsonException`) and forwards only well-formed events to the writer, mapping outcomes by `liveRows[i].RowId`. Regression tests `Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned` and `Drain_with_corrupt_head_row_does_not_stall_queue` added.

### Core.AlarmHistorian-002

| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:99-105,386-388` |
| Status | Resolved |

**Description:** The class computes an exponential-backoff value (`_backoffIndex`, `BumpBackoff`, `CurrentBackoff`, the `BackoffLadder`) and the class doc-comment states "Drain runs on a shared `Timer`. Exponential backoff on `RetryPlease`: 1s → 2s → 5s → 15s → 60s cap." However `StartDrainLoop` creates the `Timer` with a fixed `tickInterval` for both due-time and period and never reschedules it. `CurrentBackoff` is computed but never consulted by the timer, so the drain loop keeps hammering the historian at the fixed cadence regardless of `BackingOff` state. The documented backoff behavior does not exist for the production drain path — it is only observable via the `CurrentBackoff` property in tests.

**Recommendation:** Make the drain loop honor the backoff. Either switch to a self-rescheduling one-shot timer that sets its next due-time to `max(tickInterval, CurrentBackoff)` after each `DrainOnceAsync`, or have `DrainOnceAsync` skip the writer call while still inside the backoff window (track `_nextEligibleDrainUtc`). Update the doc-comment if the design intentionally changes.

**Resolution:** Resolved 2026-05-22 — `StartDrainLoop` now arms a self-rescheduling one-shot `Timer`; `RescheduleDrain` sets the next due-time to `max(tickInterval, CurrentBackoff)` while `_drainState` is `BackingOff` so a historian outage genuinely slows the cadence down the ladder. Class doc-comment updated. Regression tests `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry` and `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy` added.

### Core.AlarmHistorian-003

| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,218-243,246-253` |
| Status | Resolved |

**Description:** `EnqueueAsync` is declared `async`-shaped (`Task EnqueueAsync(...)`) and the `IAlarmHistorianSink` contract explicitly states "the sink MUST NOT block the emitting thread … `EnqueueAsync` returns as soon as the queue row is committed." But the implementation does fully synchronous, blocking SQLite I/O (`conn.Open()`, `EnforceCapacity`, `cmd.ExecuteNonQuery()`) on the caller's thread and only then returns `Task.CompletedTask`. Under SQLite write contention with the drain worker this blocks the alarm-emitting thread for the full lock-wait. The same synchronous-work-behind-an-async-or-status-API pattern applies to `GetStatus` (called from the Admin UI / `/healthz` request thread) and `RetryDeadLettered`. The `cancellationToken` parameter of `EnqueueAsync` is accepted and ignored.

**Recommendation:** Either make the I/O genuinely asynchronous (`await conn.OpenAsync(ct)`, `await cmd.ExecuteNonQueryAsync(ct)` — `Microsoft.Data.Sqlite` supports the async surface), or change `EnqueueAsync` to an in-memory hand-off (e.g. a `Channel`) drained by a background writer so the emitting thread truly never touches the database. At minimum honor the `cancellationToken` parameter.

**Resolution:** Resolved 2026-05-22 — `EnqueueAsync` now uses `OpenAsync` / `ExecuteNonQueryAsync` / `ExecuteScalarAsync` throughout (capacity check included); `ApplyPragmasAsync` handles the WAL/busy-timeout PRAGMA on the async path; `cancellationToken` is threaded through every await so cancellation is honoured.

### Core.AlarmHistorian-004

| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:90,112,176,259` |
| Status | Resolved |

**Description:** Every operation opens a brand-new `SqliteConnection` from the bare connection string `Data Source={databasePath}` — no `busy_timeout` / `Pragma`, no shared cache. SQLite serializes writers with a file lock; when `EnqueueAsync` (emitting thread) and `DrainOnceAsync` (timer thread) collide, the loser gets an immediate `SQLITE_BUSY` exception because the default busy timeout is 0. In `DrainOnceAsync` the `BeginTransaction()` / `Commit()` block can fail mid-drain with `SQLITE_BUSY`; the exception escapes the `try` (it is not the writer-call `try`), the `finally` releases the gate, and the row outcomes are lost / partially applied. The class doc-comment claims `DrainOnceAsync` is "Safe to call from multiple threads" but the concurrent enqueue-vs-drain case is not actually safe against busy errors.

**Recommendation:** Configure a non-zero busy timeout — `SqliteConnectionStringBuilder { DataSource = databasePath, DefaultTimeout = 5 }` plus `PRAGMA busy_timeout=5000` on open. Strongly consider WAL journal mode (`PRAGMA journal_mode=WAL`) so readers and the writer do not block each other. Reuse a single long-lived write connection guarded by `_drainGate` rather than opening/closing per call.

**Resolution:** Resolved 2026-05-22 — the connection string is now built via `SqliteConnectionStringBuilder` with `DefaultTimeout = 5`, and every connection is opened through a new `OpenConnection` helper that applies `PRAGMA busy_timeout=5000` and `PRAGMA journal_mode=WAL` so an enqueue/drain lock collision waits the lock out instead of throwing `SQLITE_BUSY`. All eight call sites switched to the helper. Regression test `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` added.

### Core.AlarmHistorian-005

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:66-71,141-143,199,386-388` |
| Status | Resolved |

**Description:** The mutable status fields `_lastDrainUtc`, `_lastSuccessUtc`, `_lastError`, `_drainState`, and `_backoffIndex` are written by the drain timer thread inside `DrainOnceAsync` and read concurrently by `GetStatus()` / `CurrentBackoff` on Admin-UI / health-check threads with no memory barrier (no `lock`, no `volatile`, no `Interlocked`). `DateTime?` is not guaranteed to be written atomically, and the reader can observe a stale or torn value. This is a diagnostics surface so the impact is limited, but a torn `DateTime?` read is real undefined behavior.

**Recommendation:** Guard the status fields with a small lock, or make the scalars `volatile` where the type permits and snapshot `DateTime?` values under a lock. Take the snapshot atomically in `GetStatus()`.

**Resolution:** Resolved 2026-05-22 — added `_statusLock` object; all writes to `_lastDrainUtc`, `_lastSuccessUtc`, `_lastError`, `_drainState`, and `_evictedCount` (new) now happen inside `lock (_statusLock)` blocks; `GetStatus()` snapshots all fields atomically under the same lock. Regression test `GetStatus_snapshot_is_consistent_under_concurrent_drain` added.

### Core.AlarmHistorian-006

| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216` |
| Status | Resolved |

**Description:** `StartDrainLoop` launches the drain with `new Timer(_ => _ = DrainOnceAsync(CancellationToken.None), ...)`. The returned `Task` is discarded (`_ =`), so any exception thrown by `DrainOnceAsync` is an unobserved task exception — never logged, never surfaced. Several paths in `DrainOnceAsync` can throw: the `outcomes.Count != events.Count` guard (`InvalidOperationException`), `JsonSerializer.Deserialize` on a malformed payload, `PurgeAgedDeadLetters` / `ReadBatch` / the commit block hitting `SQLITE_BUSY` or a schema error. When any of these throw, the drain silently stops making progress for that tick, `_drainState` is left stale (still `Draining`), and an operator watching the Admin UI sees no error. A persistently failing condition produces a silent, permanently stalled queue.

**Recommendation:** Wrap the timer callback body in a `try/catch` that logs the exception via `_logger.Error`, records it into `_lastError`, and resets `_drainState` so the diagnostics surface reflects the failure. Do not discard the `Task` without an attached continuation that observes faults.

**Resolution:** Resolved 2026-05-22 — the timer no longer discards the drain `Task`. A dedicated `DrainTimerCallback` `await`s `DrainOnceAsync` inside a `try/catch` that logs the fault via `_logger.Error`, records it into `_lastError`, and sets `_drainState = BackingOff` so the failure is visible on the `GetStatus` surface; a `finally` always re-arms the timer so a faulting tick can never permanently stall the queue. Regression test `StartDrainLoop_records_drain_fault_and_keeps_running` added.

### Core.AlarmHistorian-007

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:172-174` |
| Status | Resolved |

**Description:** When the writer returns a wrong-cardinality result, the code throws `InvalidOperationException` after `WriteBatchAsync` has already succeeded. The events were potentially delivered to the historian, but no rows are deleted or dead-lettered, `_drainState` is left at `Draining`, and the backoff is not bumped. Combined with Core.AlarmHistorian-006 the exception is then swallowed. On the next drain the same batch is re-sent — if the writer actually delivered the events the first time, this produces duplicate historian rows; if it is a deterministic writer bug the queue stalls forever.

**Recommendation:** Treat a cardinality mismatch as a transient batch failure: log it, set `_lastError`, bump backoff, set `_drainState = BackingOff`, and return without throwing — mirroring the writer-exception path at lines 162-170. A deterministic writer contract violation should also raise an operator-visible alert rather than silently looping.

**Resolution:** Resolved 2026-05-22 — the `throw InvalidOperationException` replaced with log-and-backoff: mismatch is recorded into `_lastError`, `_drainState = BackingOff`, backoff is bumped, and the method returns without applying any outcomes — rows stay queued for the next drain attempt. Regression test `Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows` added.

### Core.AlarmHistorian-008

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,255-278` |
| Status | Resolved |

**Description:** Each `EnqueueAsync` (one per alarm transition — a hot path on a busy plant) opens a connection, runs `EnforceCapacity` (a `COUNT(*)` over the queue table on every single enqueue), serializes JSON, inserts, and closes the connection. The unconditional `COUNT(*)` on every enqueue is an avoidable scan; the open/close churn defeats connection pooling benefits and adds lock-acquisition overhead per event. `DrainOnceAsync` similarly opens three separate connections per tick (`PurgeAgedDeadLetters`, `ReadBatch`, the transaction block).

**Recommendation:** Reuse a single pooled write connection. Replace the per-enqueue `COUNT(*)` with a periodic capacity check (every Nth enqueue, or piggy-backed on the drain tick), or maintain an in-memory approximate counter. Combine the drain-tick connections into one.

**Resolution:** Resolved 2026-05-23 — added an `Interlocked`-guarded in-memory `_queuedRowCount` seeded from storage at construction and kept current by every mutation (enqueue increment, drain Ack/PermanentFail/corrupt-dead-letter decrements, capacity-eviction adjustment, RetryDeadLettered re-add). `EnqueueAsync` now short-circuits capacity enforcement against the cached counter via `EnforceCapacityFastPathAsync`, only paying for a real `COUNT(*)` when the cached value reaches the capacity wall or the periodic resync interval (every 10,000 enqueues) elapses; the obsolete sync `EnforceCapacity` was removed. `GetStatus()` reads `QueueDepth` from the same counter so a busy Admin UI no longer hits the DB for it. `DrainOnceAsync` is consolidated onto one shared `SqliteConnection` per tick — purge, read, corrupt-dead-letter, and the outcome-applying transaction now reuse it instead of opening three. Regression tests `EnqueueAsync_does_not_count_all_rows_on_every_call_below_capacity`, `Enqueue_and_drain_keep_queue_depth_consistent_with_storage`, and `Counter_remains_consistent_under_concurrent_enqueue_and_drain` added.

### Core.AlarmHistorian-009

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:317-347` |
| Status | Resolved |

**Description:** `docs/AlarmTracking.md` and the `IAlarmHistorianSink` contract present the SQLite queue as the durability guarantee — "Durably enqueue the event", "operator acks never block on the historian being reachable". But `EnforceCapacity` silently deletes the oldest non-dead-lettered (not-yet-sent) rows when the queue reaches `DefaultCapacity` (1,000,000). Those are alarm-event records that were accepted as durably queued and are then dropped before ever reaching the historian — silent alarm-history data loss under sustained historian outage. The only signal is a `WARN` log line. Neither `docs/AlarmTracking.md` nor the sink's XML doc mentions that the durability guarantee is bounded, and there is no metric/dead-letter trail for evicted rows.

**Recommendation:** At minimum document the bounded-durability behavior in `docs/AlarmTracking.md` and the `IAlarmHistorianSink` summary. Better: surface evicted-row counts in `HistorianSinkStatus` (a dedicated counter) so the loss is operator-visible, and consider routing overflow to the dead-letter table instead of hard-deleting it so the records survive for post-mortem within the retention window.

**Resolution:** Resolved 2026-05-22 — added `EvictedCount` (default 0) to `HistorianSinkStatus` with full param-tag documentation; `EnforceCapacity` and `EnforceCapacityAsync` now increment `_evictedCount` (guarded by `_statusLock`) and include the lifetime total in the WARN log; `docs/AlarmTracking.md` documents the bounded-durability caveat and the `EvictedCount` surface. Regression test `Capacity_eviction_increments_evicted_count_on_status` added.

### Core.AlarmHistorian-010

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs` |
| Status | Resolved |

**Description:** The test suite covers the happy paths well (Ack/Retry/PermanentFail, capacity eviction, retention purge, ctor validation) but leaves critical paths untested: (a) no test exercises a corrupt / `null`-deserializing `PayloadJson` row, so the `rowIds`/`events` misalignment bug (Core.AlarmHistorian-001) was not caught; (b) no test for `StartDrainLoop` actually running on the timer, nor for the backoff being honored by the schedule (Core.AlarmHistorian-002); (c) no concurrency test running `EnqueueAsync` and `DrainOnceAsync` in parallel, which is the exact scenario that triggers `SQLITE_BUSY` (Core.AlarmHistorian-004); (d) no test for the `outcomes.Count != events.Count` cardinality-mismatch branch (Core.AlarmHistorian-007).

**Recommendation:** Add tests for: a corrupt payload row (insert raw bad JSON via a direct SQLite write, then drain and assert the correct row is dead-lettered and others are unaffected); a `FakeWriter` returning a wrong-length outcome list; a parallel enqueue/drain stress test; and the timer-driven `StartDrainLoop` path.

**Resolution:** Resolved 2026-05-22 — (a) `Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned` and `Drain_with_corrupt_head_row_does_not_stall_queue` cover corrupt payloads; (b) `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry`, `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy`, and `StartDrainLoop_records_drain_fault_and_keeps_running` cover the timer-driven path; (c) `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` covers the concurrent stress path; (d) `Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows` covers the cardinality-mismatch branch. Additionally `Capacity_eviction_increments_evicted_count_on_status` and `GetStatus_snapshot_is_consistent_under_concurrent_drain` cover -009 and -005 respectively.

### Core.AlarmHistorian-011

| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs:5-9,76`, `AlarmHistorianEvent.cs:20` |
| Status | Resolved |

**Description:** Several doc-comments reference the retired v1 architecture. The `IAlarmHistorianSink` summary says ingestion "routes through Galaxy.Host's pipe" and `IAlarmHistorianWriter` says "Stream G wires this to the Galaxy.Host IPC client", but `docs/AlarmTracking.md` and `CLAUDE.md` state the legacy `Galaxy.Host` project was retired in PR 7.2 and the write path is now the Wonderware historian sidecar (`WonderwareHistorianClient`). `AlarmHistorianEvent.cs:20` likewise says "the Galaxy.Host handler maps to the historian's enum on the wire." These stale references will mislead a reader about where the writer is actually hosted.

**Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section.

**Resolution:** Resolved 2026-05-23 — the three stale `Galaxy.Host` references were already replaced ahead of this resolution by earlier commits (`bdca772` rewrote the `IAlarmHistorianSink` summary + `IAlarmHistorianWriter` summary to name the Wonderware historian sidecar / `WonderwareHistorianClient`; `f6d487b` rewrote the `AlarmHistorianEvent.EventKind` doc-comment). A fresh grep across the project confirms no remaining `Galaxy.Host` / "Stream G wires this" strings — only the legitimate `Galaxy-native` alarm-source label survives. Status flipped to Resolved during the -008 pass; no new source change was needed.