diff --git a/code-reviews/Core.AlarmHistorian/findings.md b/code-reviews/Core.AlarmHistorian/findings.md index cc731587..01bc78e2 100644 --- a/code-reviews/Core.AlarmHistorian/findings.md +++ b/code-reviews/Core.AlarmHistorian/findings.md @@ -4,10 +4,10 @@ |---|---| | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` | | Reviewer | Claude Code | -| Review date | 2026-05-22 | -| Commit reviewed | `76d35d1` | +| Review date | 2026-06-19 | +| Commit reviewed | `621d00e4` | | Status | Reviewed | -| Open findings | 0 | +| Open findings | 1 | ## Checklist coverage @@ -190,3 +190,69 @@ **Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section. **Resolution:** Resolved 2026-05-23 — the three stale `Galaxy.Host` references were already replaced ahead of this resolution by earlier commits (`bdca772` rewrote the `IAlarmHistorianSink` summary + `IAlarmHistorianWriter` summary to name the Wonderware historian sidecar / `WonderwareHistorianClient`; `f6d487b` rewrote the `AlarmHistorianEvent.EventKind` doc-comment). A fresh grep across the project confirms no remaining `Galaxy.Host` / "Stream G wires this" strings — only the legitimate `Galaxy-native` alarm-source label survives. Status flipped to Resolved during the -008 pass; no new source change was needed. + +--- + +## Re-review 2026-06-19 (commit 621d00e4) + +All 11 prior findings confirmed resolved at HEAD. The re-review covered all 10 checklist categories against the 3 source files (874 LOC) and the single test file (820 LOC, 28 tests) at HEAD. Two new findings: one resolved in-session (Medium), one deferred (Low). + +## Checklist coverage — re-review (621d00e4, 2026-06-19) + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Core.AlarmHistorian-012 (Resolved) | +| 2 | OtOpcUa conventions | No issues found | +| 3 | Concurrency & thread safety | Core.AlarmHistorian-013 (Resolved) | +| 4 | Error handling & resilience | Core.AlarmHistorian-014 (Open-deferred) | +| 5 | Security | No issues found | +| 6 | Performance & resource management | No issues found | +| 7 | Design-document adherence | No issues found | +| 8 | Code organization & conventions | No issues found | +| 9 | Testing coverage | No issues found | +| 10 | Documentation & comments | No issues found | + +### Core.AlarmHistorian-012 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:412-415` | +| Status | Resolved | + +**Description:** When `WriteBatchAsync` throws `OperationCanceledException`, the catch at line 412–415 re-throws without resetting `_drainState`. The drain tick set `_drainState = Draining` at the top of `DrainOnceAsync` (inside `lock (_statusLock)`), and the `OperationCanceledException` path exits the method without setting it back to `Idle` or `BackingOff`. The status surface therefore permanently shows `Draining` until the next drain tick begins — any Admin UI poll or health check in that window sees a stale, misleading state. In production the drain timer always passes `CancellationToken.None` so this is not triggerable via the timer path; however, any code that calls `DrainOnceAsync` directly with a real cancellable token (future test code, operator tooling, or a shutdown path) will hit this. The `_drainState` invariant — only `Draining` while the semaphore is held — is violated. + +**Recommendation:** Before re-throwing, set `_drainState = BackingOff` under `_statusLock` so the status surface accurately reflects that the drain stopped without applying any outcomes. + +**Resolution:** Resolved 2026-06-19 — added `lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; }` in the `OperationCanceledException` catch block before re-throwing. Regression test `Drain_cancelled_mid_write_resets_drain_state_to_not_draining` added; confirmed fail-before / pass-after. All 28 tests green. + +### Core.AlarmHistorian-013 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Concurrency & thread safety | +| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:79` | +| Status | Resolved | + +**Description:** `_disposed` is a plain `bool` field read on multiple threads (`EnqueueAsync` on the emitting thread, `DrainOnceAsync` / `RescheduleDrain` on the timer thread, `StartDrainLoop` on the caller's thread) and written by `Dispose()` on the owner's thread — with no memory barrier. Per the C# language specification this is a data race. On the .NET/x64 JIT the weak read is unlikely to manifest as a bug in practice, but it is technically undefined behaviour: the compiler or JIT is permitted to cache the stale `false` value in a register across repeated checks. `_backoffIndex` was already correctly marked `volatile` for the same cross-thread-visibility reason (commit `c20d2283`); `_disposed` was not updated at the same time. + +**Recommendation:** Mark `_disposed` as `volatile bool` so all reads see the most recently written value without requiring an additional lock or `Interlocked` operation. + +**Resolution:** Resolved 2026-06-19 — `_disposed` changed to `private volatile bool _disposed` with an explanatory comment. No test added (the race is non-deterministic and the existing `Disposed_sink_rejects_enqueue` test already exercises the dispose path; a data-race test would require a memory-model fuzzer). All 28 tests green. + +### Core.AlarmHistorian-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:515-555,558-568` | +| Status | Open | + +**Description:** `GetStatus()` and `RetryDeadLettered()` have no `_disposed` guard. After `Dispose()` is called both methods still open new `SqliteConnection` objects against the database file (which persists on disk after disposal). This works because SQLite is file-backed, but the behavior is surprising and inconsistent with the convention that post-disposal calls on an `IDisposable` object throw `ObjectDisposedException`. `RetryDeadLettered()` in particular performs a write (`UPDATE Queue …`) against a sink that the owner has declared finished. If a lingering Admin UI polling loop calls `GetStatus()` after the host shuts down and disposes the sink, no error is surfaced — the method silently succeeds. + +**Recommendation:** Add `if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));` guards at the top of `GetStatus()` and `RetryDeadLettered()`, consistent with the guards already present on `EnqueueAsync` and `StartDrainLoop`. + +**Deferred:** The current behavior (silently succeeds post-disposal) is benign in production because the Admin UI is also shutting down when the host disposes the sink. Adding the guard is a one-line change per method but requires coordinating with the Admin UI callers to confirm they tolerate `ObjectDisposedException` after shutdown — that is a cross-module concern. Deferring until the Admin UI shutdown sequence is reviewed. diff --git a/src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs b/src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs index f00ea542..5127bf71 100644 --- a/src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs +++ b/src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs @@ -76,7 +76,9 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable private Timer? _drainTimer; private TimeSpan _tickInterval; private volatile int _backoffIndex; - private bool _disposed; + // volatile: read by EnqueueAsync, DrainOnceAsync, RescheduleDrain, and StartDrainLoop + // from different threads; Dispose() writes it from the owner's thread (Core.AlarmHistorian-013). + private volatile bool _disposed; // Core.AlarmHistorian-005: status fields written by the drain timer thread and // read concurrently by GetStatus() / health-check threads. Guard all reads and @@ -411,6 +413,10 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable } catch (OperationCanceledException) { + // Core.AlarmHistorian-012: reset _drainState so the status surface does + // not stay stuck at Draining after the cancellation unwinds the method. + // The row stays queued; the next drain tick will retry it. + lock (_statusLock) { _drainState = HistorianDrainState.BackingOff; } throw; } catch (Exception ex) diff --git a/tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs b/tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs index d0348822..4096f275 100644 --- a/tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs +++ b/tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs @@ -786,6 +786,55 @@ public sealed class SqliteStoreAndForwardSinkTests : IDisposable sink.GetStatus().QueueDepth.ShouldBe(0, "every event drained at the end of the run"); } + /// + /// Regression for Core.AlarmHistorian-012: when WriteBatchAsync throws + /// the drain exits via re-throw + /// without resetting _drainState, leaving the status surface permanently + /// showing Draining. The next call to + /// must reflect Idle or BackingOff, not the stale Draining + /// state that was set at the top of the drain tick. + /// + [Fact] + public async Task Drain_cancelled_mid_write_resets_drain_state_to_not_draining() + { + var writer = new CancellableWriter(); + using var sink = new SqliteStoreAndForwardSink(_dbPath, writer, _log); + await sink.EnqueueAsync(Event("A1"), CancellationToken.None); + + using var cts = new CancellationTokenSource(); + writer.CancelOn(cts); + + // The drain will cancel inside WriteBatchAsync — OperationCanceledException re-throws. + await Should.ThrowAsync( + () => sink.DrainOnceAsync(cts.Token)); + + // DrainState must NOT be left as Draining after the cancellation escapes. + sink.GetStatus().DrainState.ShouldNotBe(HistorianDrainState.Draining, + "cancellation must clear the Draining status; row stays queued for next tick"); + + // The row must still be in the queue — the cancelled drain applied no outcomes. + sink.GetStatus().QueueDepth.ShouldBe(1, "row stays queued when drain is cancelled"); + } + + /// A writer that cancels its own CancellationTokenSource on the first call. + private sealed class CancellableWriter : IAlarmHistorianWriter + { + private CancellationTokenSource? _cts; + + /// Register the token source to cancel on first write. + public void CancelOn(CancellationTokenSource cts) => _cts = cts; + + /// Writes a batch; cancels via the registered CTS on the first call. + public Task> WriteBatchAsync( + IReadOnlyList batch, CancellationToken ct) + { + _cts?.Cancel(); + ct.ThrowIfCancellationRequested(); + var outcomes = Enumerable.Repeat(HistorianWriteOutcome.Ack, batch.Count).ToList(); + return Task.FromResult>(outcomes); + } + } + /// /// Helper that confirms the queue depth surfaced by GetStatus matches a fresh /// COUNT(*) read directly from storage — proves the in-memory counter has not