review(Core.AlarmHistorian): reset drain state on cancel + volatile _disposed
Re-review at 7286d320. -012 (Medium): OperationCanceledException left _drainState stuck
at Draining on the status surface; now resets to BackingOff + test. -013: _disposed ->
volatile (mirrors _backoffIndex). -014 (post-dispose status guards) deferred cross-module.
This commit is contained in:
@@ -4,10 +4,10 @@
|
||||
|---|---|
|
||||
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-22 |
|
||||
| Commit reviewed | `76d35d1` |
|
||||
| Review date | 2026-06-19 |
|
||||
| Commit reviewed | `621d00e4` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 0 |
|
||||
| Open findings | 1 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
@@ -190,3 +190,69 @@
|
||||
**Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section.
|
||||
|
||||
**Resolution:** Resolved 2026-05-23 — the three stale `Galaxy.Host` references were already replaced ahead of this resolution by earlier commits (`bdca772` rewrote the `IAlarmHistorianSink` summary + `IAlarmHistorianWriter` summary to name the Wonderware historian sidecar / `WonderwareHistorianClient`; `f6d487b` rewrote the `AlarmHistorianEvent.EventKind` doc-comment). A fresh grep across the project confirms no remaining `Galaxy.Host` / "Stream G wires this" strings — only the legitimate `Galaxy-native` alarm-source label survives. Status flipped to Resolved during the -008 pass; no new source change was needed.
|
||||
|
||||
---
|
||||
|
||||
## Re-review 2026-06-19 (commit 621d00e4)
|
||||
|
||||
All 11 prior findings confirmed resolved at HEAD. The re-review covered all 10 checklist categories against the 3 source files (874 LOC) and the single test file (820 LOC, 28 tests) at HEAD. Two new findings: one resolved in-session (Medium), one deferred (Low).
|
||||
|
||||
## Checklist coverage — re-review (621d00e4, 2026-06-19)
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness & logic bugs | Core.AlarmHistorian-012 (Resolved) |
|
||||
| 2 | OtOpcUa conventions | No issues found |
|
||||
| 3 | Concurrency & thread safety | Core.AlarmHistorian-013 (Resolved) |
|
||||
| 4 | Error handling & resilience | Core.AlarmHistorian-014 (Open-deferred) |
|
||||
| 5 | Security | No issues found |
|
||||
| 6 | Performance & resource management | No issues found |
|
||||
| 7 | Design-document adherence | No issues found |
|
||||
| 8 | Code organization & conventions | No issues found |
|
||||
| 9 | Testing coverage | No issues found |
|
||||
| 10 | Documentation & comments | No issues found |
|
||||
|
||||
### Core.AlarmHistorian-012
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:412-415` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** When `WriteBatchAsync` throws `OperationCanceledException`, the catch at line 412–415 re-throws without resetting `_drainState`. The drain tick set `_drainState = Draining` at the top of `DrainOnceAsync` (inside `lock (_statusLock)`), and the `OperationCanceledException` path exits the method without setting it back to `Idle` or `BackingOff`. The status surface therefore permanently shows `Draining` until the next drain tick begins — any Admin UI poll or health check in that window sees a stale, misleading state. In production the drain timer always passes `CancellationToken.None` so this is not triggerable via the timer path; however, any code that calls `DrainOnceAsync` directly with a real cancellable token (future test code, operator tooling, or a shutdown path) will hit this. The `_drainState` invariant — only `Draining` while the semaphore is held — is violated.
|
||||
|
||||
**Recommendation:** Before re-throwing, set `_drainState = BackingOff` under `_statusLock` so the status surface accurately reflects that the drain stopped without applying any outcomes.
|
||||
|
||||
**Resolution:** Resolved 2026-06-19 — added `lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }` in the `OperationCanceledException` catch block before re-throwing. Regression test `Drain_cancelled_mid_write_resets_drain_state_to_not_draining` added; confirmed fail-before / pass-after. All 28 tests green.
|
||||
|
||||
### Core.AlarmHistorian-013
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:79` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `_disposed` is a plain `bool` field read on multiple threads (`EnqueueAsync` on the emitting thread, `DrainOnceAsync` / `RescheduleDrain` on the timer thread, `StartDrainLoop` on the caller's thread) and written by `Dispose()` on the owner's thread — with no memory barrier. Per the C# language specification this is a data race. On the .NET/x64 JIT the weak read is unlikely to manifest as a bug in practice, but it is technically undefined behaviour: the compiler or JIT is permitted to cache the stale `false` value in a register across repeated checks. `_backoffIndex` was already correctly marked `volatile` for the same cross-thread-visibility reason (commit `c20d2283`); `_disposed` was not updated at the same time.
|
||||
|
||||
**Recommendation:** Mark `_disposed` as `volatile bool` so all reads see the most recently written value without requiring an additional lock or `Interlocked` operation.
|
||||
|
||||
**Resolution:** Resolved 2026-06-19 — `_disposed` changed to `private volatile bool _disposed` with an explanatory comment. No test added (the race is non-deterministic and the existing `Disposed_sink_rejects_enqueue` test already exercises the dispose path; a data-race test would require a memory-model fuzzer). All 28 tests green.
|
||||
|
||||
### Core.AlarmHistorian-014
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:515-555,558-568` |
|
||||
| Status | Open |
|
||||
|
||||
**Description:** `GetStatus()` and `RetryDeadLettered()` have no `_disposed` guard. After `Dispose()` is called both methods still open new `SqliteConnection` objects against the database file (which persists on disk after disposal). This works because SQLite is file-backed, but the behavior is surprising and inconsistent with the convention that post-disposal calls on an `IDisposable` object throw `ObjectDisposedException`. `RetryDeadLettered()` in particular performs a write (`UPDATE Queue …`) against a sink that the owner has declared finished. If a lingering Admin UI polling loop calls `GetStatus()` after the host shuts down and disposes the sink, no error is surfaced — the method silently succeeds.
|
||||
|
||||
**Recommendation:** Add `if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));` guards at the top of `GetStatus()` and `RetryDeadLettered()`, consistent with the guards already present on `EnqueueAsync` and `StartDrainLoop`.
|
||||
|
||||
**Deferred:** The current behavior (silently succeeds post-disposal) is benign in production because the Admin UI is also shutting down when the host disposes the sink. Adding the guard is a one-line change per method but requires coordinating with the Admin UI callers to confirm they tolerate `ObjectDisposedException` after shutdown — that is a cross-module concern. Deferring until the Admin UI shutdown sequence is reviewed.
|
||||
|
||||
@@ -76,7 +76,9 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
|
||||
private Timer? _drainTimer;
|
||||
private TimeSpan _tickInterval;
|
||||
private volatile int _backoffIndex;
|
||||
private bool _disposed;
|
||||
// volatile: read by EnqueueAsync, DrainOnceAsync, RescheduleDrain, and StartDrainLoop
|
||||
// from different threads; Dispose() writes it from the owner's thread (Core.AlarmHistorian-013).
|
||||
private volatile bool _disposed;
|
||||
|
||||
// Core.AlarmHistorian-005: status fields written by the drain timer thread and
|
||||
// read concurrently by GetStatus() / health-check threads. Guard all reads and
|
||||
@@ -411,6 +413,10 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
|
||||
}
|
||||
catch (OperationCanceledException)
|
||||
{
|
||||
// Core.AlarmHistorian-012: reset _drainState so the status surface does
|
||||
// not stay stuck at Draining after the cancellation unwinds the method.
|
||||
// The row stays queued; the next drain tick will retry it.
|
||||
lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }
|
||||
throw;
|
||||
}
|
||||
catch (Exception ex)
|
||||
|
||||
+49
@@ -786,6 +786,55 @@ public sealed class SqliteStoreAndForwardSinkTests : IDisposable
|
||||
sink.GetStatus().QueueDepth.ShouldBe(0, "every event drained at the end of the run");
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Regression for Core.AlarmHistorian-012: when WriteBatchAsync throws
|
||||
/// <see cref="OperationCanceledException"/> the drain exits via re-throw
|
||||
/// without resetting <c>_drainState</c>, leaving the status surface permanently
|
||||
/// showing <c>Draining</c>. The next call to <see cref="SqliteStoreAndForwardSink.GetStatus"/>
|
||||
/// must reflect <c>Idle</c> or <c>BackingOff</c>, not the stale <c>Draining</c>
|
||||
/// state that was set at the top of the drain tick.
|
||||
/// </summary>
|
||||
[Fact]
|
||||
public async Task Drain_cancelled_mid_write_resets_drain_state_to_not_draining()
|
||||
{
|
||||
var writer = new CancellableWriter();
|
||||
using var sink = new SqliteStoreAndForwardSink(_dbPath, writer, _log);
|
||||
await sink.EnqueueAsync(Event("A1"), CancellationToken.None);
|
||||
|
||||
using var cts = new CancellationTokenSource();
|
||||
writer.CancelOn(cts);
|
||||
|
||||
// The drain will cancel inside WriteBatchAsync — OperationCanceledException re-throws.
|
||||
await Should.ThrowAsync<OperationCanceledException>(
|
||||
() => sink.DrainOnceAsync(cts.Token));
|
||||
|
||||
// DrainState must NOT be left as Draining after the cancellation escapes.
|
||||
sink.GetStatus().DrainState.ShouldNotBe(HistorianDrainState.Draining,
|
||||
"cancellation must clear the Draining status; row stays queued for next tick");
|
||||
|
||||
// The row must still be in the queue — the cancelled drain applied no outcomes.
|
||||
sink.GetStatus().QueueDepth.ShouldBe(1, "row stays queued when drain is cancelled");
|
||||
}
|
||||
|
||||
/// <summary>A writer that cancels its own CancellationTokenSource on the first call.</summary>
|
||||
private sealed class CancellableWriter : IAlarmHistorianWriter
|
||||
{
|
||||
private CancellationTokenSource? _cts;
|
||||
|
||||
/// <summary>Register the token source to cancel on first write.</summary>
|
||||
public void CancelOn(CancellationTokenSource cts) => _cts = cts;
|
||||
|
||||
/// <summary>Writes a batch; cancels via the registered CTS on the first call.</summary>
|
||||
public Task<IReadOnlyList<HistorianWriteOutcome>> WriteBatchAsync(
|
||||
IReadOnlyList<AlarmHistorianEvent> batch, CancellationToken ct)
|
||||
{
|
||||
_cts?.Cancel();
|
||||
ct.ThrowIfCancellationRequested();
|
||||
var outcomes = Enumerable.Repeat(HistorianWriteOutcome.Ack, batch.Count).ToList();
|
||||
return Task.FromResult<IReadOnlyList<HistorianWriteOutcome>>(outcomes);
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Helper that confirms the queue depth surfaced by GetStatus matches a fresh
|
||||
/// COUNT(*) read directly from storage — proves the in-memory counter has not
|
||||
|
||||
Reference in New Issue
Block a user