Files
lmxopcua/code-reviews/Core.AlarmHistorian/findings.md
T
Joseph Doherty 6b4210cb17 review(Core.AlarmHistorian): reset drain state on cancel + volatile _disposed
Re-review at 7286d320. -012 (Medium): OperationCanceledException left _drainState stuck
at Draining on the status surface; now resets to BackingOff + test. -013: _disposed ->
volatile (mirrors _backoffIndex). -014 (post-dispose status guards) deferred cross-module.
2026-06-19 11:21:35 -04:00

259 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Core.AlarmHistorian
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` |
| Reviewer | Claude Code |
| Review date | 2026-06-19 |
| Commit reviewed | `621d00e4` |
| Status | Reviewed |
| Open findings | 1 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.AlarmHistorian-001, Core.AlarmHistorian-002 |
| 2 | OtOpcUa conventions | Core.AlarmHistorian-003 |
| 3 | Concurrency & thread safety | Core.AlarmHistorian-004, Core.AlarmHistorian-005 |
| 4 | Error handling & resilience | Core.AlarmHistorian-006, Core.AlarmHistorian-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.AlarmHistorian-008 |
| 7 | Design-document adherence | Core.AlarmHistorian-009 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Core.AlarmHistorian-010 |
| 10 | Documentation & comments | Core.AlarmHistorian-011 |
## Findings
### Core.AlarmHistorian-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:255-278` |
| Status | Resolved |
**Description:** `ReadBatch` builds two parallel lists, `rowIds` and `events`, that `DrainOnceAsync` later indexes together (`rowIds[i]` paired with `outcomes[i]`, where `outcomes` is 1:1 with `events`). But `rowIds.Add(reader.GetInt64(0))` runs unconditionally for every row, while `events.Add(evt)` is guarded by `if (evt is not null)`. If `JsonSerializer.Deserialize<AlarmHistorianEvent>` returns `null` for any row (corrupt or empty payload), `rowIds` gains an entry but `events` does not. The writer then returns `outcomes.Count == events.Count`, which passes the `outcomes.Count != events.Count` guard, and the per-row loop applies each outcome to `rowIds[i]` — every row from the skipped index onward is mapped to the wrong event's outcome. An `Ack` can delete a row whose event was never sent to the historian (silent alarm-event data loss), and a `PermanentFail` can dead-letter an unrelated good row. The corrupt row itself is never advanced and is re-read on every drain forever, permanently stalling the queue head.
**Recommendation:** Keep `rowIds` and `events` strictly aligned. Either skip the `rowId` when deserialization returns `null`, or — better — treat a `null`/failed deserialization as an immediate dead-letter for that specific `RowId` (it can never succeed) and exclude it from the batch passed to the writer. Carry the `rowId` inside a single list of `(long RowId, AlarmHistorianEvent Event)` tuples so the two can never drift.
**Resolution:** Resolved 2026-05-22 — `ReadBatch` now returns a single list of `QueueRow(long RowId, AlarmHistorianEvent? Event)` records so a rowId can never drift from its event; `DrainOnceAsync` immediately dead-letters rows whose payload is null/un-deserializable (also catching `JsonException`) and forwards only well-formed events to the writer, mapping outcomes by `liveRows[i].RowId`. Regression tests `Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned` and `Drain_with_corrupt_head_row_does_not_stall_queue` added.
### Core.AlarmHistorian-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:99-105,386-388` |
| Status | Resolved |
**Description:** The class computes an exponential-backoff value (`_backoffIndex`, `BumpBackoff`, `CurrentBackoff`, the `BackoffLadder`) and the class doc-comment states "Drain runs on a shared `Timer`. Exponential backoff on `RetryPlease`: 1s → 2s → 5s → 15s → 60s cap." However `StartDrainLoop` creates the `Timer` with a fixed `tickInterval` for both due-time and period and never reschedules it. `CurrentBackoff` is computed but never consulted by the timer, so the drain loop keeps hammering the historian at the fixed cadence regardless of `BackingOff` state. The documented backoff behavior does not exist for the production drain path — it is only observable via the `CurrentBackoff` property in tests.
**Recommendation:** Make the drain loop honor the backoff. Either switch to a self-rescheduling one-shot timer that sets its next due-time to `max(tickInterval, CurrentBackoff)` after each `DrainOnceAsync`, or have `DrainOnceAsync` skip the writer call while still inside the backoff window (track `_nextEligibleDrainUtc`). Update the doc-comment if the design intentionally changes.
**Resolution:** Resolved 2026-05-22 — `StartDrainLoop` now arms a self-rescheduling one-shot `Timer`; `RescheduleDrain` sets the next due-time to `max(tickInterval, CurrentBackoff)` while `_drainState` is `BackingOff` so a historian outage genuinely slows the cadence down the ladder. Class doc-comment updated. Regression tests `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry` and `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy` added.
### Core.AlarmHistorian-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,218-243,246-253` |
| Status | Resolved |
**Description:** `EnqueueAsync` is declared `async`-shaped (`Task EnqueueAsync(...)`) and the `IAlarmHistorianSink` contract explicitly states "the sink MUST NOT block the emitting thread … `EnqueueAsync` returns as soon as the queue row is committed." But the implementation does fully synchronous, blocking SQLite I/O (`conn.Open()`, `EnforceCapacity`, `cmd.ExecuteNonQuery()`) on the caller's thread and only then returns `Task.CompletedTask`. Under SQLite write contention with the drain worker this blocks the alarm-emitting thread for the full lock-wait. The same synchronous-work-behind-an-async-or-status-API pattern applies to `GetStatus` (called from the Admin UI / `/healthz` request thread) and `RetryDeadLettered`. The `cancellationToken` parameter of `EnqueueAsync` is accepted and ignored.
**Recommendation:** Either make the I/O genuinely asynchronous (`await conn.OpenAsync(ct)`, `await cmd.ExecuteNonQueryAsync(ct)``Microsoft.Data.Sqlite` supports the async surface), or change `EnqueueAsync` to an in-memory hand-off (e.g. a `Channel`) drained by a background writer so the emitting thread truly never touches the database. At minimum honor the `cancellationToken` parameter.
**Resolution:** Resolved 2026-05-22 — `EnqueueAsync` now uses `OpenAsync` / `ExecuteNonQueryAsync` / `ExecuteScalarAsync` throughout (capacity check included); `ApplyPragmasAsync` handles the WAL/busy-timeout PRAGMA on the async path; `cancellationToken` is threaded through every await so cancellation is honoured.
### Core.AlarmHistorian-004
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:90,112,176,259` |
| Status | Resolved |
**Description:** Every operation opens a brand-new `SqliteConnection` from the bare connection string `Data Source={databasePath}` — no `busy_timeout` / `Pragma`, no shared cache. SQLite serializes writers with a file lock; when `EnqueueAsync` (emitting thread) and `DrainOnceAsync` (timer thread) collide, the loser gets an immediate `SQLITE_BUSY` exception because the default busy timeout is 0. In `DrainOnceAsync` the `BeginTransaction()` / `Commit()` block can fail mid-drain with `SQLITE_BUSY`; the exception escapes the `try` (it is not the writer-call `try`), the `finally` releases the gate, and the row outcomes are lost / partially applied. The class doc-comment claims `DrainOnceAsync` is "Safe to call from multiple threads" but the concurrent enqueue-vs-drain case is not actually safe against busy errors.
**Recommendation:** Configure a non-zero busy timeout — `SqliteConnectionStringBuilder { DataSource = databasePath, DefaultTimeout = 5 }` plus `PRAGMA busy_timeout=5000` on open. Strongly consider WAL journal mode (`PRAGMA journal_mode=WAL`) so readers and the writer do not block each other. Reuse a single long-lived write connection guarded by `_drainGate` rather than opening/closing per call.
**Resolution:** Resolved 2026-05-22 — the connection string is now built via `SqliteConnectionStringBuilder` with `DefaultTimeout = 5`, and every connection is opened through a new `OpenConnection` helper that applies `PRAGMA busy_timeout=5000` and `PRAGMA journal_mode=WAL` so an enqueue/drain lock collision waits the lock out instead of throwing `SQLITE_BUSY`. All eight call sites switched to the helper. Regression test `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` added.
### Core.AlarmHistorian-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:66-71,141-143,199,386-388` |
| Status | Resolved |
**Description:** The mutable status fields `_lastDrainUtc`, `_lastSuccessUtc`, `_lastError`, `_drainState`, and `_backoffIndex` are written by the drain timer thread inside `DrainOnceAsync` and read concurrently by `GetStatus()` / `CurrentBackoff` on Admin-UI / health-check threads with no memory barrier (no `lock`, no `volatile`, no `Interlocked`). `DateTime?` is not guaranteed to be written atomically, and the reader can observe a stale or torn value. This is a diagnostics surface so the impact is limited, but a torn `DateTime?` read is real undefined behavior.
**Recommendation:** Guard the status fields with a small lock, or make the scalars `volatile` where the type permits and snapshot `DateTime?` values under a lock. Take the snapshot atomically in `GetStatus()`.
**Resolution:** Resolved 2026-05-22 — added `_statusLock` object; all writes to `_lastDrainUtc`, `_lastSuccessUtc`, `_lastError`, `_drainState`, and `_evictedCount` (new) now happen inside `lock (_statusLock)` blocks; `GetStatus()` snapshots all fields atomically under the same lock. Regression test `GetStatus_snapshot_is_consistent_under_concurrent_drain` added.
### Core.AlarmHistorian-006
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216` |
| Status | Resolved |
**Description:** `StartDrainLoop` launches the drain with `new Timer(_ => _ = DrainOnceAsync(CancellationToken.None), ...)`. The returned `Task` is discarded (`_ =`), so any exception thrown by `DrainOnceAsync` is an unobserved task exception — never logged, never surfaced. Several paths in `DrainOnceAsync` can throw: the `outcomes.Count != events.Count` guard (`InvalidOperationException`), `JsonSerializer.Deserialize` on a malformed payload, `PurgeAgedDeadLetters` / `ReadBatch` / the commit block hitting `SQLITE_BUSY` or a schema error. When any of these throw, the drain silently stops making progress for that tick, `_drainState` is left stale (still `Draining`), and an operator watching the Admin UI sees no error. A persistently failing condition produces a silent, permanently stalled queue.
**Recommendation:** Wrap the timer callback body in a `try/catch` that logs the exception via `_logger.Error`, records it into `_lastError`, and resets `_drainState` so the diagnostics surface reflects the failure. Do not discard the `Task` without an attached continuation that observes faults.
**Resolution:** Resolved 2026-05-22 — the timer no longer discards the drain `Task`. A dedicated `DrainTimerCallback` `await`s `DrainOnceAsync` inside a `try/catch` that logs the fault via `_logger.Error`, records it into `_lastError`, and sets `_drainState = BackingOff` so the failure is visible on the `GetStatus` surface; a `finally` always re-arms the timer so a faulting tick can never permanently stall the queue. Regression test `StartDrainLoop_records_drain_fault_and_keeps_running` added.
### Core.AlarmHistorian-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:172-174` |
| Status | Resolved |
**Description:** When the writer returns a wrong-cardinality result, the code throws `InvalidOperationException` after `WriteBatchAsync` has already succeeded. The events were potentially delivered to the historian, but no rows are deleted or dead-lettered, `_drainState` is left at `Draining`, and the backoff is not bumped. Combined with Core.AlarmHistorian-006 the exception is then swallowed. On the next drain the same batch is re-sent — if the writer actually delivered the events the first time, this produces duplicate historian rows; if it is a deterministic writer bug the queue stalls forever.
**Recommendation:** Treat a cardinality mismatch as a transient batch failure: log it, set `_lastError`, bump backoff, set `_drainState = BackingOff`, and return without throwing — mirroring the writer-exception path at lines 162-170. A deterministic writer contract violation should also raise an operator-visible alert rather than silently looping.
**Resolution:** Resolved 2026-05-22 — the `throw InvalidOperationException` replaced with log-and-backoff: mismatch is recorded into `_lastError`, `_drainState = BackingOff`, backoff is bumped, and the method returns without applying any outcomes — rows stay queued for the next drain attempt. Regression test `Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows` added.
### Core.AlarmHistorian-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,255-278` |
| Status | Resolved |
**Description:** Each `EnqueueAsync` (one per alarm transition — a hot path on a busy plant) opens a connection, runs `EnforceCapacity` (a `COUNT(*)` over the queue table on every single enqueue), serializes JSON, inserts, and closes the connection. The unconditional `COUNT(*)` on every enqueue is an avoidable scan; the open/close churn defeats connection pooling benefits and adds lock-acquisition overhead per event. `DrainOnceAsync` similarly opens three separate connections per tick (`PurgeAgedDeadLetters`, `ReadBatch`, the transaction block).
**Recommendation:** Reuse a single pooled write connection. Replace the per-enqueue `COUNT(*)` with a periodic capacity check (every Nth enqueue, or piggy-backed on the drain tick), or maintain an in-memory approximate counter. Combine the drain-tick connections into one.
**Resolution:** Resolved 2026-05-23 — added an `Interlocked`-guarded in-memory `_queuedRowCount` seeded from storage at construction and kept current by every mutation (enqueue increment, drain Ack/PermanentFail/corrupt-dead-letter decrements, capacity-eviction adjustment, RetryDeadLettered re-add). `EnqueueAsync` now short-circuits capacity enforcement against the cached counter via `EnforceCapacityFastPathAsync`, only paying for a real `COUNT(*)` when the cached value reaches the capacity wall or the periodic resync interval (every 10,000 enqueues) elapses; the obsolete sync `EnforceCapacity` was removed. `GetStatus()` reads `QueueDepth` from the same counter so a busy Admin UI no longer hits the DB for it. `DrainOnceAsync` is consolidated onto one shared `SqliteConnection` per tick — purge, read, corrupt-dead-letter, and the outcome-applying transaction now reuse it instead of opening three. Regression tests `EnqueueAsync_does_not_count_all_rows_on_every_call_below_capacity`, `Enqueue_and_drain_keep_queue_depth_consistent_with_storage`, and `Counter_remains_consistent_under_concurrent_enqueue_and_drain` added.
### Core.AlarmHistorian-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:317-347` |
| Status | Resolved |
**Description:** `docs/AlarmTracking.md` and the `IAlarmHistorianSink` contract present the SQLite queue as the durability guarantee — "Durably enqueue the event", "operator acks never block on the historian being reachable". But `EnforceCapacity` silently deletes the oldest non-dead-lettered (not-yet-sent) rows when the queue reaches `DefaultCapacity` (1,000,000). Those are alarm-event records that were accepted as durably queued and are then dropped before ever reaching the historian — silent alarm-history data loss under sustained historian outage. The only signal is a `WARN` log line. Neither `docs/AlarmTracking.md` nor the sink's XML doc mentions that the durability guarantee is bounded, and there is no metric/dead-letter trail for evicted rows.
**Recommendation:** At minimum document the bounded-durability behavior in `docs/AlarmTracking.md` and the `IAlarmHistorianSink` summary. Better: surface evicted-row counts in `HistorianSinkStatus` (a dedicated counter) so the loss is operator-visible, and consider routing overflow to the dead-letter table instead of hard-deleting it so the records survive for post-mortem within the retention window.
**Resolution:** Resolved 2026-05-22 — added `EvictedCount` (default 0) to `HistorianSinkStatus` with full param-tag documentation; `EnforceCapacity` and `EnforceCapacityAsync` now increment `_evictedCount` (guarded by `_statusLock`) and include the lifetime total in the WARN log; `docs/AlarmTracking.md` documents the bounded-durability caveat and the `EvictedCount` surface. Regression test `Capacity_eviction_increments_evicted_count_on_status` added.
### Core.AlarmHistorian-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs` |
| Status | Resolved |
**Description:** The test suite covers the happy paths well (Ack/Retry/PermanentFail, capacity eviction, retention purge, ctor validation) but leaves critical paths untested: (a) no test exercises a corrupt / `null`-deserializing `PayloadJson` row, so the `rowIds`/`events` misalignment bug (Core.AlarmHistorian-001) was not caught; (b) no test for `StartDrainLoop` actually running on the timer, nor for the backoff being honored by the schedule (Core.AlarmHistorian-002); (c) no concurrency test running `EnqueueAsync` and `DrainOnceAsync` in parallel, which is the exact scenario that triggers `SQLITE_BUSY` (Core.AlarmHistorian-004); (d) no test for the `outcomes.Count != events.Count` cardinality-mismatch branch (Core.AlarmHistorian-007).
**Recommendation:** Add tests for: a corrupt payload row (insert raw bad JSON via a direct SQLite write, then drain and assert the correct row is dead-lettered and others are unaffected); a `FakeWriter` returning a wrong-length outcome list; a parallel enqueue/drain stress test; and the timer-driven `StartDrainLoop` path.
**Resolution:** Resolved 2026-05-22 — (a) `Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned` and `Drain_with_corrupt_head_row_does_not_stall_queue` cover corrupt payloads; (b) `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry`, `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy`, and `StartDrainLoop_records_drain_fault_and_keeps_running` cover the timer-driven path; (c) `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` covers the concurrent stress path; (d) `Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows` covers the cardinality-mismatch branch. Additionally `Capacity_eviction_increments_evicted_count_on_status` and `GetStatus_snapshot_is_consistent_under_concurrent_drain` cover -009 and -005 respectively.
### Core.AlarmHistorian-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs:5-9,76`, `AlarmHistorianEvent.cs:20` |
| Status | Resolved |
**Description:** Several doc-comments reference the retired v1 architecture. The `IAlarmHistorianSink` summary says ingestion "routes through Galaxy.Host's pipe" and `IAlarmHistorianWriter` says "Stream G wires this to the Galaxy.Host IPC client", but `docs/AlarmTracking.md` and `CLAUDE.md` state the legacy `Galaxy.Host` project was retired in PR 7.2 and the write path is now the Wonderware historian sidecar (`WonderwareHistorianClient`). `AlarmHistorianEvent.cs:20` likewise says "the Galaxy.Host handler maps to the historian's enum on the wire." These stale references will mislead a reader about where the writer is actually hosted.
**Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section.
**Resolution:** Resolved 2026-05-23 — the three stale `Galaxy.Host` references were already replaced ahead of this resolution by earlier commits (`bdca772` rewrote the `IAlarmHistorianSink` summary + `IAlarmHistorianWriter` summary to name the Wonderware historian sidecar / `WonderwareHistorianClient`; `f6d487b` rewrote the `AlarmHistorianEvent.EventKind` doc-comment). A fresh grep across the project confirms no remaining `Galaxy.Host` / "Stream G wires this" strings — only the legitimate `Galaxy-native` alarm-source label survives. Status flipped to Resolved during the -008 pass; no new source change was needed.
---
## Re-review 2026-06-19 (commit 621d00e4)
All 11 prior findings confirmed resolved at HEAD. The re-review covered all 10 checklist categories against the 3 source files (874 LOC) and the single test file (820 LOC, 28 tests) at HEAD. Two new findings: one resolved in-session (Medium), one deferred (Low).
## Checklist coverage — re-review (621d00e4, 2026-06-19)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.AlarmHistorian-012 (Resolved) |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.AlarmHistorian-013 (Resolved) |
| 4 | Error handling & resilience | Core.AlarmHistorian-014 (Open-deferred) |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | No issues found |
### Core.AlarmHistorian-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:412-415` |
| Status | Resolved |
**Description:** When `WriteBatchAsync` throws `OperationCanceledException`, the catch at line 412415 re-throws without resetting `_drainState`. The drain tick set `_drainState = Draining` at the top of `DrainOnceAsync` (inside `lock (_statusLock)`), and the `OperationCanceledException` path exits the method without setting it back to `Idle` or `BackingOff`. The status surface therefore permanently shows `Draining` until the next drain tick begins — any Admin UI poll or health check in that window sees a stale, misleading state. In production the drain timer always passes `CancellationToken.None` so this is not triggerable via the timer path; however, any code that calls `DrainOnceAsync` directly with a real cancellable token (future test code, operator tooling, or a shutdown path) will hit this. The `_drainState` invariant — only `Draining` while the semaphore is held — is violated.
**Recommendation:** Before re-throwing, set `_drainState = BackingOff` under `_statusLock` so the status surface accurately reflects that the drain stopped without applying any outcomes.
**Resolution:** Resolved 2026-06-19 — added `lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }` in the `OperationCanceledException` catch block before re-throwing. Regression test `Drain_cancelled_mid_write_resets_drain_state_to_not_draining` added; confirmed fail-before / pass-after. All 28 tests green.
### Core.AlarmHistorian-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:79` |
| Status | Resolved |
**Description:** `_disposed` is a plain `bool` field read on multiple threads (`EnqueueAsync` on the emitting thread, `DrainOnceAsync` / `RescheduleDrain` on the timer thread, `StartDrainLoop` on the caller's thread) and written by `Dispose()` on the owner's thread — with no memory barrier. Per the C# language specification this is a data race. On the .NET/x64 JIT the weak read is unlikely to manifest as a bug in practice, but it is technically undefined behaviour: the compiler or JIT is permitted to cache the stale `false` value in a register across repeated checks. `_backoffIndex` was already correctly marked `volatile` for the same cross-thread-visibility reason (commit `c20d2283`); `_disposed` was not updated at the same time.
**Recommendation:** Mark `_disposed` as `volatile bool` so all reads see the most recently written value without requiring an additional lock or `Interlocked` operation.
**Resolution:** Resolved 2026-06-19 — `_disposed` changed to `private volatile bool _disposed` with an explanatory comment. No test added (the race is non-deterministic and the existing `Disposed_sink_rejects_enqueue` test already exercises the dispose path; a data-race test would require a memory-model fuzzer). All 28 tests green.
### Core.AlarmHistorian-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:515-555,558-568` |
| Status | Open |
**Description:** `GetStatus()` and `RetryDeadLettered()` have no `_disposed` guard. After `Dispose()` is called both methods still open new `SqliteConnection` objects against the database file (which persists on disk after disposal). This works because SQLite is file-backed, but the behavior is surprising and inconsistent with the convention that post-disposal calls on an `IDisposable` object throw `ObjectDisposedException`. `RetryDeadLettered()` in particular performs a write (`UPDATE Queue …`) against a sink that the owner has declared finished. If a lingering Admin UI polling loop calls `GetStatus()` after the host shuts down and disposes the sink, no error is surfaced — the method silently succeeds.
**Recommendation:** Add `if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));` guards at the top of `GetStatus()` and `RetryDeadLettered()`, consistent with the guards already present on `EnqueueAsync` and `StartDrainLoop`.
**Deferred:** The current behavior (silently succeeds post-disposal) is benign in production because the Admin UI is also shutting down when the host disposes the sink. Adding the guard is a one-line change per method but requires coordinating with the Admin UI callers to confirm they tolerate `ObjectDisposedException` after shutdown — that is a cross-module concern. Deferring until the Admin UI shutdown sequence is reviewed.