review(Core.AlarmHistorian): reset drain state on cancel + volatile _disposed

Re-review at 7286d320. -012 (Medium): OperationCanceledException left _drainState stuck at Draining on the status surface; now resets to BackingOff + test. -013: _disposed -> volatile (mirrors _backoffIndex). -014 (post-dispose status guards) deferred cross-module.
2026-06-19 11:21:35 -04:00
parent 48af117bff
commit 6b4210cb17
3 changed files with 125 additions and 4 deletions
@@ -4,10 +4,10 @@
 |---|---|
 | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` |
 | Reviewer | Claude Code |
-| Review date | 2026-05-22 |
-| Commit reviewed | `76d35d1` |
+| Review date | 2026-06-19 |
+| Commit reviewed | `621d00e4` |
 | Status | Reviewed |
-| Open findings | 0 |
+| Open findings | 1 |

 ## Checklist coverage

@@ -190,3 +190,69 @@
 **Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section.

 **Resolution:** Resolved 2026-05-23 — the three stale `Galaxy.Host` references were already replaced ahead of this resolution by earlier commits (`bdca772` rewrote the `IAlarmHistorianSink` summary + `IAlarmHistorianWriter` summary to name the Wonderware historian sidecar / `WonderwareHistorianClient`; `f6d487b` rewrote the `AlarmHistorianEvent.EventKind` doc-comment). A fresh grep across the project confirms no remaining `Galaxy.Host` / "Stream G wires this" strings — only the legitimate `Galaxy-native` alarm-source label survives. Status flipped to Resolved during the -008 pass; no new source change was needed.
+
+---
+
+## Re-review 2026-06-19 (commit 621d00e4)
+
+All 11 prior findings confirmed resolved at HEAD. The re-review covered all 10 checklist categories against the 3 source files (874 LOC) and the single test file (820 LOC, 28 tests) at HEAD. Two new findings: one resolved in-session (Medium), one deferred (Low).
+
+## Checklist coverage — re-review (621d00e4, 2026-06-19)
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | Core.AlarmHistorian-012 (Resolved) |
+| 2 | OtOpcUa conventions | No issues found |
+| 3 | Concurrency & thread safety | Core.AlarmHistorian-013 (Resolved) |
+| 4 | Error handling & resilience | Core.AlarmHistorian-014 (Open-deferred) |
+| 5 | Security | No issues found |
+| 6 | Performance & resource management | No issues found |
+| 7 | Design-document adherence | No issues found |
+| 8 | Code organization & conventions | No issues found |
+| 9 | Testing coverage | No issues found |
+| 10 | Documentation & comments | No issues found |
+
+### Core.AlarmHistorian-012
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:412-415` |
+| Status | Resolved |
+
+**Description:** When `WriteBatchAsync` throws `OperationCanceledException`, the catch at line 412–415 re-throws without resetting `_drainState`. The drain tick set `_drainState = Draining` at the top of `DrainOnceAsync` (inside `lock (_statusLock)`), and the `OperationCanceledException` path exits the method without setting it back to `Idle` or `BackingOff`. The status surface therefore permanently shows `Draining` until the next drain tick begins — any Admin UI poll or health check in that window sees a stale, misleading state. In production the drain timer always passes `CancellationToken.None` so this is not triggerable via the timer path; however, any code that calls `DrainOnceAsync` directly with a real cancellable token (future test code, operator tooling, or a shutdown path) will hit this. The `_drainState` invariant — only `Draining` while the semaphore is held — is violated.
+
+**Recommendation:** Before re-throwing, set `_drainState = BackingOff` under `_statusLock` so the status surface accurately reflects that the drain stopped without applying any outcomes.
+
+**Resolution:** Resolved 2026-06-19 — added `lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }` in the `OperationCanceledException` catch block before re-throwing. Regression test `Drain_cancelled_mid_write_resets_drain_state_to_not_draining` added; confirmed fail-before / pass-after. All 28 tests green.
+
+### Core.AlarmHistorian-013
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:79` |
+| Status | Resolved |
+
+**Description:** `_disposed` is a plain `bool` field read on multiple threads (`EnqueueAsync` on the emitting thread, `DrainOnceAsync` / `RescheduleDrain` on the timer thread, `StartDrainLoop` on the caller's thread) and written by `Dispose()` on the owner's thread — with no memory barrier. Per the C# language specification this is a data race. On the .NET/x64 JIT the weak read is unlikely to manifest as a bug in practice, but it is technically undefined behaviour: the compiler or JIT is permitted to cache the stale `false` value in a register across repeated checks. `_backoffIndex` was already correctly marked `volatile` for the same cross-thread-visibility reason (commit `c20d2283`); `_disposed` was not updated at the same time.
+
+**Recommendation:** Mark `_disposed` as `volatile bool` so all reads see the most recently written value without requiring an additional lock or `Interlocked` operation.
+
+**Resolution:** Resolved 2026-06-19 — `_disposed` changed to `private volatile bool _disposed` with an explanatory comment. No test added (the race is non-deterministic and the existing `Disposed_sink_rejects_enqueue` test already exercises the dispose path; a data-race test would require a memory-model fuzzer). All 28 tests green.
+
+### Core.AlarmHistorian-014
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:515-555,558-568` |
+| Status | Open |
+
+**Description:** `GetStatus()` and `RetryDeadLettered()` have no `_disposed` guard. After `Dispose()` is called both methods still open new `SqliteConnection` objects against the database file (which persists on disk after disposal). This works because SQLite is file-backed, but the behavior is surprising and inconsistent with the convention that post-disposal calls on an `IDisposable` object throw `ObjectDisposedException`. `RetryDeadLettered()` in particular performs a write (`UPDATE Queue …`) against a sink that the owner has declared finished. If a lingering Admin UI polling loop calls `GetStatus()` after the host shuts down and disposes the sink, no error is surfaced — the method silently succeeds.
+
+**Recommendation:** Add `if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));` guards at the top of `GetStatus()` and `RetryDeadLettered()`, consistent with the guards already present on `EnqueueAsync` and `StartDrainLoop`.
+
+**Deferred:** The current behavior (silently succeeds post-disposal) is benign in production because the Admin UI is also shutting down when the host disposes the sink. Adding the guard is a one-line change per method but requires coordinating with the Admin UI callers to confirm they tolerate `ObjectDisposedException` after shutdown — that is a cross-module concern. Deferring until the Admin UI shutdown sequence is reviewed.
@@ -76,7 +76,9 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
    private Timer? _drainTimer;
    private TimeSpan _tickInterval;
    private volatile int _backoffIndex;
-    private bool _disposed;
+    // volatile: read by EnqueueAsync, DrainOnceAsync, RescheduleDrain, and StartDrainLoop
+    // from different threads; Dispose() writes it from the owner's thread (Core.AlarmHistorian-013).
+    private volatile bool _disposed;

    // Core.AlarmHistorian-005: status fields written by the drain timer thread and
    // read concurrently by GetStatus() / health-check threads. Guard all reads and
@@ -411,6 +413,10 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
            }
            catch (OperationCanceledException)
            {
+                // Core.AlarmHistorian-012: reset _drainState so the status surface does
+                // not stay stuck at Draining after the cancellation unwinds the method.
+                // The row stays queued; the next drain tick will retry it.
+                lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }
                throw;
            }
            catch (Exception ex)
@@ -786,6 +786,55 @@ public sealed class SqliteStoreAndForwardSinkTests : IDisposable
        sink.GetStatus().QueueDepth.ShouldBe(0, "every event drained at the end of the run");
    }

+    /// <summary>
+    ///     Regression for Core.AlarmHistorian-012: when WriteBatchAsync throws
+    ///     <see cref="OperationCanceledException"/> the drain exits via re-throw
+    ///     without resetting <c>_drainState</c>, leaving the status surface permanently
+    ///     showing <c>Draining</c>. The next call to <see cref="SqliteStoreAndForwardSink.GetStatus"/>
+    ///     must reflect <c>Idle</c> or <c>BackingOff</c>, not the stale <c>Draining</c>
+    ///     state that was set at the top of the drain tick.
+    /// </summary>
+    [Fact]
+    public async Task Drain_cancelled_mid_write_resets_drain_state_to_not_draining()
+    {
+        var writer = new CancellableWriter();
+        using var sink = new SqliteStoreAndForwardSink(_dbPath, writer, _log);
+        await sink.EnqueueAsync(Event("A1"), CancellationToken.None);
+
+        using var cts = new CancellationTokenSource();
+        writer.CancelOn(cts);
+
+        // The drain will cancel inside WriteBatchAsync — OperationCanceledException re-throws.
+        await Should.ThrowAsync<OperationCanceledException>(
+            () => sink.DrainOnceAsync(cts.Token));
+
+        // DrainState must NOT be left as Draining after the cancellation escapes.
+        sink.GetStatus().DrainState.ShouldNotBe(HistorianDrainState.Draining,
+            "cancellation must clear the Draining status; row stays queued for next tick");
+
+        // The row must still be in the queue — the cancelled drain applied no outcomes.
+        sink.GetStatus().QueueDepth.ShouldBe(1, "row stays queued when drain is cancelled");
+    }
+
+    /// <summary>A writer that cancels its own CancellationTokenSource on the first call.</summary>
+    private sealed class CancellableWriter : IAlarmHistorianWriter
+    {
+        private CancellationTokenSource? _cts;
+
+        /// <summary>Register the token source to cancel on first write.</summary>
+        public void CancelOn(CancellationTokenSource cts) => _cts = cts;
+
+        /// <summary>Writes a batch; cancels via the registered CTS on the first call.</summary>
+        public Task<IReadOnlyList<HistorianWriteOutcome>> WriteBatchAsync(
+            IReadOnlyList<AlarmHistorianEvent> batch, CancellationToken ct)
+        {
+            _cts?.Cancel();
+            ct.ThrowIfCancellationRequested();
+            var outcomes = Enumerable.Repeat(HistorianWriteOutcome.Ack, batch.Count).ToList();
+            return Task.FromResult<IReadOnlyList<HistorianWriteOutcome>>(outcomes);
+        }
+    }
+
    /// <summary>
    ///     Helper that confirms the queue depth surfaced by GetStatus matches a fresh
    ///     COUNT(*) read directly from storage — proves the in-memory counter has not