Files

Joseph Doherty 796871c210 fix(alarm-historian): keep queue rows aligned to events on drain (Core.AlarmHistorian-001)

ReadBatch built parallel rowIds / events lists: rowIds.Add ran for every
row but events.Add was guarded by `if (evt is not null)`. A corrupt /
null-deserializing payload desynced the lists, so DrainOnceAsync applied
each outcome to the wrong RowId — an Ack could delete an un-sent event
(silent alarm-event data loss) and the corrupt row stalled the queue
head forever.

ReadBatch now returns a single list of QueueRow(long RowId,
AlarmHistorianEvent? Event) records so a rowId can never drift from its
event; deserialization is wrapped to yield null on JsonException.
DrainOnceAsync immediately dead-letters rows whose payload is
null/un-deserializable and forwards only well-formed events to the
writer, mapping outcomes by RowId.

Regression tests cover a corrupt row mid-batch and at the queue head.
Core.AlarmHistorian suite: 16/16 pass.

Resolves code-review finding Core.AlarmHistorian-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 05:54:20 -04:00

16 KiB

Raw Blame History

Code Review — Core.AlarmHistorian

Field	Value
Module	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian`
Reviewer	Claude Code
Review date	2026-05-22
Commit reviewed	`76d35d1`
Status	Reviewed
Open findings	10

Checklist coverage

#	Category	Result
1	Correctness & logic bugs	Core.AlarmHistorian-001, Core.AlarmHistorian-002
2	OtOpcUa conventions	Core.AlarmHistorian-003
3	Concurrency & thread safety	Core.AlarmHistorian-004, Core.AlarmHistorian-005
4	Error handling & resilience	Core.AlarmHistorian-006, Core.AlarmHistorian-007
5	Security	No issues found
6	Performance & resource management	Core.AlarmHistorian-008
7	Design-document adherence	Core.AlarmHistorian-009
8	Code organization & conventions	No issues found
9	Testing coverage	Core.AlarmHistorian-010
10	Documentation & comments	Core.AlarmHistorian-011

Findings

Core.AlarmHistorian-001

Field	Value
Severity	Critical
Category	Correctness & logic bugs
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:255-278`
Status	Resolved

Description: ReadBatch builds two parallel lists, rowIds and events, that DrainOnceAsync later indexes together (rowIds[i] paired with outcomes[i], where outcomes is 1:1 with events). But rowIds.Add(reader.GetInt64(0)) runs unconditionally for every row, while events.Add(evt) is guarded by if (evt is not null). If JsonSerializer.Deserialize<AlarmHistorianEvent> returns null for any row (corrupt or empty payload), rowIds gains an entry but events does not. The writer then returns outcomes.Count == events.Count, which passes the outcomes.Count != events.Count guard, and the per-row loop applies each outcome to rowIds[i] — every row from the skipped index onward is mapped to the wrong event's outcome. An Ack can delete a row whose event was never sent to the historian (silent alarm-event data loss), and a PermanentFail can dead-letter an unrelated good row. The corrupt row itself is never advanced and is re-read on every drain forever, permanently stalling the queue head.

Recommendation: Keep rowIds and events strictly aligned. Either skip the rowId when deserialization returns null, or — better — treat a null/failed deserialization as an immediate dead-letter for that specific RowId (it can never succeed) and exclude it from the batch passed to the writer. Carry the rowId inside a single list of (long RowId, AlarmHistorianEvent Event) tuples so the two can never drift.

Resolution: Resolved 2026-05-22 — ReadBatch now returns a single list of QueueRow(long RowId, AlarmHistorianEvent? Event) records so a rowId can never drift from its event; DrainOnceAsync immediately dead-letters rows whose payload is null/un-deserializable (also catching JsonException) and forwards only well-formed events to the writer, mapping outcomes by liveRows[i].RowId. Regression tests Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned and Drain_with_corrupt_head_row_does_not_stall_queue added.

Core.AlarmHistorian-002

Field	Value
Severity	High
Category	Correctness & logic bugs
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:99-105,386-388`
Status	Open

Description: The class computes an exponential-backoff value (_backoffIndex, BumpBackoff, CurrentBackoff, the BackoffLadder) and the class doc-comment states "Drain runs on a shared Timer. Exponential backoff on RetryPlease: 1s → 2s → 5s → 15s → 60s cap." However StartDrainLoop creates the Timer with a fixed tickInterval for both due-time and period and never reschedules it. CurrentBackoff is computed but never consulted by the timer, so the drain loop keeps hammering the historian at the fixed cadence regardless of BackingOff state. The documented backoff behavior does not exist for the production drain path — it is only observable via the CurrentBackoff property in tests.

Recommendation: Make the drain loop honor the backoff. Either switch to a self-rescheduling one-shot timer that sets its next due-time to max(tickInterval, CurrentBackoff) after each DrainOnceAsync, or have DrainOnceAsync skip the writer call while still inside the backoff window (track _nextEligibleDrainUtc). Update the doc-comment if the design intentionally changes.

Resolution: (open)

Core.AlarmHistorian-003

Field	Value
Severity	Medium
Category	OtOpcUa conventions
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,218-243,246-253`
Status	Open

Description: EnqueueAsync is declared async-shaped (Task EnqueueAsync(...)) and the IAlarmHistorianSink contract explicitly states "the sink MUST NOT block the emitting thread … EnqueueAsync returns as soon as the queue row is committed." But the implementation does fully synchronous, blocking SQLite I/O (conn.Open(), EnforceCapacity, cmd.ExecuteNonQuery()) on the caller's thread and only then returns Task.CompletedTask. Under SQLite write contention with the drain worker this blocks the alarm-emitting thread for the full lock-wait. The same synchronous-work-behind-an-async-or-status-API pattern applies to GetStatus (called from the Admin UI / /healthz request thread) and RetryDeadLettered. The cancellationToken parameter of EnqueueAsync is accepted and ignored.

Recommendation: Either make the I/O genuinely asynchronous (await conn.OpenAsync(ct), await cmd.ExecuteNonQueryAsync(ct) — Microsoft.Data.Sqlite supports the async surface), or change EnqueueAsync to an in-memory hand-off (e.g. a Channel) drained by a background writer so the emitting thread truly never touches the database. At minimum honor the cancellationToken parameter.

Resolution: (open)

Core.AlarmHistorian-004

Field	Value
Severity	High
Category	Concurrency & thread safety
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:90,112,176,259`
Status	Open

Description: Every operation opens a brand-new SqliteConnection from the bare connection string Data Source={databasePath} — no busy_timeout / Pragma, no shared cache. SQLite serializes writers with a file lock; when EnqueueAsync (emitting thread) and DrainOnceAsync (timer thread) collide, the loser gets an immediate SQLITE_BUSY exception because the default busy timeout is 0. In DrainOnceAsync the BeginTransaction() / Commit() block can fail mid-drain with SQLITE_BUSY; the exception escapes the try (it is not the writer-call try), the finally releases the gate, and the row outcomes are lost / partially applied. The class doc-comment claims DrainOnceAsync is "Safe to call from multiple threads" but the concurrent enqueue-vs-drain case is not actually safe against busy errors.

Recommendation: Configure a non-zero busy timeout — SqliteConnectionStringBuilder { DataSource = databasePath, DefaultTimeout = 5 } plus PRAGMA busy_timeout=5000 on open. Strongly consider WAL journal mode (PRAGMA journal_mode=WAL) so readers and the writer do not block each other. Reuse a single long-lived write connection guarded by _drainGate rather than opening/closing per call.

Resolution: (open)

Core.AlarmHistorian-005

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:66-71,141-143,199,386-388`
Status	Open

Description: The mutable status fields _lastDrainUtc, _lastSuccessUtc, _lastError, _drainState, and _backoffIndex are written by the drain timer thread inside DrainOnceAsync and read concurrently by GetStatus() / CurrentBackoff on Admin-UI / health-check threads with no memory barrier (no lock, no volatile, no Interlocked). DateTime? is not guaranteed to be written atomically, and the reader can observe a stale or torn value. This is a diagnostics surface so the impact is limited, but a torn DateTime? read is real undefined behavior.

Recommendation: Guard the status fields with a small lock, or make the scalars volatile where the type permits and snapshot DateTime? values under a lock. Take the snapshot atomically in GetStatus().

Resolution: (open)

Core.AlarmHistorian-006

Field	Value
Severity	High
Category	Error handling & resilience
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216`
Status	Open

Description: StartDrainLoop launches the drain with new Timer(_ => _ = DrainOnceAsync(CancellationToken.None), ...). The returned Task is discarded (_ =), so any exception thrown by DrainOnceAsync is an unobserved task exception — never logged, never surfaced. Several paths in DrainOnceAsync can throw: the outcomes.Count != events.Count guard (InvalidOperationException), JsonSerializer.Deserialize on a malformed payload, PurgeAgedDeadLetters / ReadBatch / the commit block hitting SQLITE_BUSY or a schema error. When any of these throw, the drain silently stops making progress for that tick, _drainState is left stale (still Draining), and an operator watching the Admin UI sees no error. A persistently failing condition produces a silent, permanently stalled queue.

Recommendation: Wrap the timer callback body in a try/catch that logs the exception via _logger.Error, records it into _lastError, and resets _drainState so the diagnostics surface reflects the failure. Do not discard the Task without an attached continuation that observes faults.

Resolution: (open)

Core.AlarmHistorian-007

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:172-174`
Status	Open

Description: When the writer returns a wrong-cardinality result, the code throws InvalidOperationException after WriteBatchAsync has already succeeded. The events were potentially delivered to the historian, but no rows are deleted or dead-lettered, _drainState is left at Draining, and the backoff is not bumped. Combined with Core.AlarmHistorian-006 the exception is then swallowed. On the next drain the same batch is re-sent — if the writer actually delivered the events the first time, this produces duplicate historian rows; if it is a deterministic writer bug the queue stalls forever.

Recommendation: Treat a cardinality mismatch as a transient batch failure: log it, set _lastError, bump backoff, set _drainState = BackingOff, and return without throwing — mirroring the writer-exception path at lines 162-170. A deterministic writer contract violation should also raise an operator-visible alert rather than silently looping.

Resolution: (open)

Core.AlarmHistorian-008

Field	Value
Severity	Low
Category	Performance & resource management
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,255-278`
Status	Open

Description: Each EnqueueAsync (one per alarm transition — a hot path on a busy plant) opens a connection, runs EnforceCapacity (a COUNT(*) over the queue table on every single enqueue), serializes JSON, inserts, and closes the connection. The unconditional COUNT(*) on every enqueue is an avoidable scan; the open/close churn defeats connection pooling benefits and adds lock-acquisition overhead per event. DrainOnceAsync similarly opens three separate connections per tick (PurgeAgedDeadLetters, ReadBatch, the transaction block).

Recommendation: Reuse a single pooled write connection. Replace the per-enqueue COUNT(*) with a periodic capacity check (every Nth enqueue, or piggy-backed on the drain tick), or maintain an in-memory approximate counter. Combine the drain-tick connections into one.

Resolution: (open)

Core.AlarmHistorian-009

Field	Value
Severity	Medium
Category	Design-document adherence
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:317-347`
Status	Open

Description: docs/AlarmTracking.md and the IAlarmHistorianSink contract present the SQLite queue as the durability guarantee — "Durably enqueue the event", "operator acks never block on the historian being reachable". But EnforceCapacity silently deletes the oldest non-dead-lettered (not-yet-sent) rows when the queue reaches DefaultCapacity (1,000,000). Those are alarm-event records that were accepted as durably queued and are then dropped before ever reaching the historian — silent alarm-history data loss under sustained historian outage. The only signal is a WARN log line. Neither docs/AlarmTracking.md nor the sink's XML doc mentions that the durability guarantee is bounded, and there is no metric/dead-letter trail for evicted rows.

Recommendation: At minimum document the bounded-durability behavior in docs/AlarmTracking.md and the IAlarmHistorianSink summary. Better: surface evicted-row counts in HistorianSinkStatus (a dedicated counter) so the loss is operator-visible, and consider routing overflow to the dead-letter table instead of hard-deleting it so the records survive for post-mortem within the retention window.

Resolution: (open)

Core.AlarmHistorian-010

Field	Value
Severity	Medium
Category	Testing coverage
Location	`tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs`
Status	Open

Description: The test suite covers the happy paths well (Ack/Retry/PermanentFail, capacity eviction, retention purge, ctor validation) but leaves critical paths untested: (a) no test exercises a corrupt / null-deserializing PayloadJson row, so the rowIds/events misalignment bug (Core.AlarmHistorian-001) was not caught; (b) no test for StartDrainLoop actually running on the timer, nor for the backoff being honored by the schedule (Core.AlarmHistorian-002); (c) no concurrency test running EnqueueAsync and DrainOnceAsync in parallel, which is the exact scenario that triggers SQLITE_BUSY (Core.AlarmHistorian-004); (d) no test for the outcomes.Count != events.Count cardinality-mismatch branch (Core.AlarmHistorian-007).

Recommendation: Add tests for: a corrupt payload row (insert raw bad JSON via a direct SQLite write, then drain and assert the correct row is dead-lettered and others are unaffected); a FakeWriter returning a wrong-length outcome list; a parallel enqueue/drain stress test; and the timer-driven StartDrainLoop path.

Resolution: (open)

Core.AlarmHistorian-011

Field	Value
Severity	Low
Category	Documentation & comments
Location	`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs:5-9,76`, `AlarmHistorianEvent.cs:20`
Status	Open

Description: Several doc-comments reference the retired v1 architecture. The IAlarmHistorianSink summary says ingestion "routes through Galaxy.Host's pipe" and IAlarmHistorianWriter says "Stream G wires this to the Galaxy.Host IPC client", but docs/AlarmTracking.md and CLAUDE.md state the legacy Galaxy.Host project was retired in PR 7.2 and the write path is now the Wonderware historian sidecar (WonderwareHistorianClient). AlarmHistorianEvent.cs:20 likewise says "the Galaxy.Host handler maps to the historian's enum on the wire." These stale references will mislead a reader about where the writer is actually hosted.

Recommendation: Update the doc-comments to refer to the Wonderware historian sidecar / WonderwareHistorianClient (IAlarmHistorianWriter implementation) instead of Galaxy.Host, consistent with docs/AlarmTracking.md's "Historian write-back" section.

Resolution: (open)

16 KiB Raw Blame History

Code Review — Core.AlarmHistorian

Checklist coverage

Findings

Core.AlarmHistorian-001

Core.AlarmHistorian-002

Core.AlarmHistorian-003

Core.AlarmHistorian-004

Core.AlarmHistorian-005

Core.AlarmHistorian-006

Core.AlarmHistorian-007

Core.AlarmHistorian-008

Core.AlarmHistorian-009

Core.AlarmHistorian-010

Core.AlarmHistorian-011

16 KiB

Raw Blame History