review(Core.AlarmHistorian): reset drain state on cancel + volatile _disposed

Re-review at 7286d320. -012 (Medium): OperationCanceledException left _drainState stuck
at Draining on the status surface; now resets to BackingOff + test. -013: _disposed ->
volatile (mirrors _backoffIndex). -014 (post-dispose status guards) deferred cross-module.
This commit is contained in:
Joseph Doherty
2026-06-19 11:21:35 -04:00
parent 48af117bff
commit 6b4210cb17
3 changed files with 125 additions and 4 deletions
+69 -3
View File
@@ -4,10 +4,10 @@
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-06-19 |
| Commit reviewed | `621d00e4` |
| Status | Reviewed |
| Open findings | 0 |
| Open findings | 1 |
## Checklist coverage
@@ -190,3 +190,69 @@
**Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section.
**Resolution:** Resolved 2026-05-23 — the three stale `Galaxy.Host` references were already replaced ahead of this resolution by earlier commits (`bdca772` rewrote the `IAlarmHistorianSink` summary + `IAlarmHistorianWriter` summary to name the Wonderware historian sidecar / `WonderwareHistorianClient`; `f6d487b` rewrote the `AlarmHistorianEvent.EventKind` doc-comment). A fresh grep across the project confirms no remaining `Galaxy.Host` / "Stream G wires this" strings — only the legitimate `Galaxy-native` alarm-source label survives. Status flipped to Resolved during the -008 pass; no new source change was needed.
---
## Re-review 2026-06-19 (commit 621d00e4)
All 11 prior findings confirmed resolved at HEAD. The re-review covered all 10 checklist categories against the 3 source files (874 LOC) and the single test file (820 LOC, 28 tests) at HEAD. Two new findings: one resolved in-session (Medium), one deferred (Low).
## Checklist coverage — re-review (621d00e4, 2026-06-19)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.AlarmHistorian-012 (Resolved) |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.AlarmHistorian-013 (Resolved) |
| 4 | Error handling & resilience | Core.AlarmHistorian-014 (Open-deferred) |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | No issues found |
### Core.AlarmHistorian-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:412-415` |
| Status | Resolved |
**Description:** When `WriteBatchAsync` throws `OperationCanceledException`, the catch at line 412415 re-throws without resetting `_drainState`. The drain tick set `_drainState = Draining` at the top of `DrainOnceAsync` (inside `lock (_statusLock)`), and the `OperationCanceledException` path exits the method without setting it back to `Idle` or `BackingOff`. The status surface therefore permanently shows `Draining` until the next drain tick begins — any Admin UI poll or health check in that window sees a stale, misleading state. In production the drain timer always passes `CancellationToken.None` so this is not triggerable via the timer path; however, any code that calls `DrainOnceAsync` directly with a real cancellable token (future test code, operator tooling, or a shutdown path) will hit this. The `_drainState` invariant — only `Draining` while the semaphore is held — is violated.
**Recommendation:** Before re-throwing, set `_drainState = BackingOff` under `_statusLock` so the status surface accurately reflects that the drain stopped without applying any outcomes.
**Resolution:** Resolved 2026-06-19 — added `lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }` in the `OperationCanceledException` catch block before re-throwing. Regression test `Drain_cancelled_mid_write_resets_drain_state_to_not_draining` added; confirmed fail-before / pass-after. All 28 tests green.
### Core.AlarmHistorian-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:79` |
| Status | Resolved |
**Description:** `_disposed` is a plain `bool` field read on multiple threads (`EnqueueAsync` on the emitting thread, `DrainOnceAsync` / `RescheduleDrain` on the timer thread, `StartDrainLoop` on the caller's thread) and written by `Dispose()` on the owner's thread — with no memory barrier. Per the C# language specification this is a data race. On the .NET/x64 JIT the weak read is unlikely to manifest as a bug in practice, but it is technically undefined behaviour: the compiler or JIT is permitted to cache the stale `false` value in a register across repeated checks. `_backoffIndex` was already correctly marked `volatile` for the same cross-thread-visibility reason (commit `c20d2283`); `_disposed` was not updated at the same time.
**Recommendation:** Mark `_disposed` as `volatile bool` so all reads see the most recently written value without requiring an additional lock or `Interlocked` operation.
**Resolution:** Resolved 2026-06-19 — `_disposed` changed to `private volatile bool _disposed` with an explanatory comment. No test added (the race is non-deterministic and the existing `Disposed_sink_rejects_enqueue` test already exercises the dispose path; a data-race test would require a memory-model fuzzer). All 28 tests green.
### Core.AlarmHistorian-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:515-555,558-568` |
| Status | Open |
**Description:** `GetStatus()` and `RetryDeadLettered()` have no `_disposed` guard. After `Dispose()` is called both methods still open new `SqliteConnection` objects against the database file (which persists on disk after disposal). This works because SQLite is file-backed, but the behavior is surprising and inconsistent with the convention that post-disposal calls on an `IDisposable` object throw `ObjectDisposedException`. `RetryDeadLettered()` in particular performs a write (`UPDATE Queue …`) against a sink that the owner has declared finished. If a lingering Admin UI polling loop calls `GetStatus()` after the host shuts down and disposes the sink, no error is surfaced — the method silently succeeds.
**Recommendation:** Add `if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));` guards at the top of `GetStatus()` and `RetryDeadLettered()`, consistent with the guards already present on `EnqueueAsync` and `StartDrainLoop`.
**Deferred:** The current behavior (silently succeeds post-disposal) is benign in production because the Admin UI is also shutting down when the host disposes the sink. Adding the guard is a one-line change per method but requires coordinating with the Admin UI callers to confirm they tolerate `ObjectDisposedException` after shutdown — that is a cross-module concern. Deferring until the Admin UI shutdown sequence is reviewed.
@@ -76,7 +76,9 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
private Timer? _drainTimer;
private TimeSpan _tickInterval;
private volatile int _backoffIndex;
private bool _disposed;
// volatile: read by EnqueueAsync, DrainOnceAsync, RescheduleDrain, and StartDrainLoop
// from different threads; Dispose() writes it from the owner's thread (Core.AlarmHistorian-013).
private volatile bool _disposed;
// Core.AlarmHistorian-005: status fields written by the drain timer thread and
// read concurrently by GetStatus() / health-check threads. Guard all reads and
@@ -411,6 +413,10 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
}
catch (OperationCanceledException)
{
// Core.AlarmHistorian-012: reset _drainState so the status surface does
// not stay stuck at Draining after the cancellation unwinds the method.
// The row stays queued; the next drain tick will retry it.
lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }
throw;
}
catch (Exception ex)
@@ -786,6 +786,55 @@ public sealed class SqliteStoreAndForwardSinkTests : IDisposable
sink.GetStatus().QueueDepth.ShouldBe(0, "every event drained at the end of the run");
}
/// <summary>
/// Regression for Core.AlarmHistorian-012: when WriteBatchAsync throws
/// <see cref="OperationCanceledException"/> the drain exits via re-throw
/// without resetting <c>_drainState</c>, leaving the status surface permanently
/// showing <c>Draining</c>. The next call to <see cref="SqliteStoreAndForwardSink.GetStatus"/>
/// must reflect <c>Idle</c> or <c>BackingOff</c>, not the stale <c>Draining</c>
/// state that was set at the top of the drain tick.
/// </summary>
[Fact]
public async Task Drain_cancelled_mid_write_resets_drain_state_to_not_draining()
{
var writer = new CancellableWriter();
using var sink = new SqliteStoreAndForwardSink(_dbPath, writer, _log);
await sink.EnqueueAsync(Event("A1"), CancellationToken.None);
using var cts = new CancellationTokenSource();
writer.CancelOn(cts);
// The drain will cancel inside WriteBatchAsync — OperationCanceledException re-throws.
await Should.ThrowAsync<OperationCanceledException>(
() => sink.DrainOnceAsync(cts.Token));
// DrainState must NOT be left as Draining after the cancellation escapes.
sink.GetStatus().DrainState.ShouldNotBe(HistorianDrainState.Draining,
"cancellation must clear the Draining status; row stays queued for next tick");
// The row must still be in the queue — the cancelled drain applied no outcomes.
sink.GetStatus().QueueDepth.ShouldBe(1, "row stays queued when drain is cancelled");
}
/// <summary>A writer that cancels its own CancellationTokenSource on the first call.</summary>
private sealed class CancellableWriter : IAlarmHistorianWriter
{
private CancellationTokenSource? _cts;
/// <summary>Register the token source to cancel on first write.</summary>
public void CancelOn(CancellationTokenSource cts) => _cts = cts;
/// <summary>Writes a batch; cancels via the registered CTS on the first call.</summary>
public Task<IReadOnlyList<HistorianWriteOutcome>> WriteBatchAsync(
IReadOnlyList<AlarmHistorianEvent> batch, CancellationToken ct)
{
_cts?.Cancel();
ct.ThrowIfCancellationRequested();
var outcomes = Enumerable.Repeat(HistorianWriteOutcome.Ack, batch.Count).ToList();
return Task.FromResult<IReadOnlyList<HistorianWriteOutcome>>(outcomes);
}
}
/// <summary>
/// Helper that confirms the queue depth surfaced by GetStatus matches a fresh
/// COUNT(*) read directly from storage — proves the in-memory counter has not