Files

Joseph Doherty 99354bfaf2 fix(core-scripted-alarms): resolve Low code-review findings (Core.ScriptedAlarms-003,006,008,010,011; -009 documented)

- Core.ScriptedAlarms-003: emit OnEvent OUTSIDE _evalGate by collecting
  pending emissions during the gate-held section and flushing them after
  release; eliminates re-entrancy deadlock the docs already promised.
- Core.ScriptedAlarms-006: track every fire-and-forget Reevaluate /
  ShelvingCheck task in _inFlight; Dispose drains the set so the engine
  no longer races store writes against teardown.
- Core.ScriptedAlarms-008: store comments as ImmutableList<AlarmComment>
  so AppendComment is O(log n) instead of O(n).
- Core.ScriptedAlarms-010: document the deliberate input-quality
  asymmetry (Uncertain drives the predicate, renders {?} in the message)
  in docs/ScriptedAlarms.md and on MessageTemplate.Resolve remarks.
- Core.ScriptedAlarms-011: propagate the no-op reason through
  TransitionResult.NoOp(state, reason) and log it from
  ScriptedAlarmEngine.ApplyAsync.
- Core.ScriptedAlarms-009 (Won't Fix per recommendation): documented the
  per-evaluation dictionary allocation in docs/v2/Galaxy.Performance.md
  with a mitigation path if a future soak surfaces pressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 07:23:31 -04:00

21 KiB

Raw Blame History

Code Review — Core.ScriptedAlarms

Field	Value
Module	`src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms`
Reviewer	Claude Code
Review date	2026-05-22
Commit reviewed	`76d35d1`
Status	Reviewed
Open findings	0

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

#	Category	Result
1	Correctness & logic bugs	Core.ScriptedAlarms-002
2	OtOpcUa conventions	No issues found
3	Concurrency & thread safety	Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006
4	Error handling & resilience	Core.ScriptedAlarms-007
5	Security	No issues found
6	Performance & resource management	Core.ScriptedAlarms-008, Core.ScriptedAlarms-009
7	Design-document adherence	Core.ScriptedAlarms-010
8	Code organization & conventions	Core.ScriptedAlarms-011
9	Testing coverage	Core.ScriptedAlarms-012
10	Documentation & comments	Core.ScriptedAlarms-003

Findings

Core.ScriptedAlarms-001

Field	Value
Severity	High
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368`
Status	Resolved

Description: _alarms is a plain Dictionary<string, AlarmState> (line 42). Every mutation of it (LoadAsync, ApplyAsync, ReevaluateAsync, ShelvingCheckAsync) correctly happens under the _evalGate semaphore, but four read paths touch it with no synchronisation: GetState (line 175), GetAllStates (line 178-179), the LoadedAlarmIds property (line 73), and RunShelvingCheck (line 368, _alarms.Keys.ToArray()). RunShelvingCheck fires from a Timer thread-pool callback and can run concurrently with an ApplyAsync/ReevaluateAsync that is reassigning a dictionary entry. Dictionary is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw InvalidOperationException or return torn state. GetState/GetAllStates are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.

Recommendation: Either switch _alarms to ConcurrentDictionary<string, AlarmState> (entry reassignment via _alarms[id] = ... is already the only write shape, which a ConcurrentDictionary supports atomically), or acquire _evalGate in every reader. A ConcurrentDictionary is the lighter change and matches _valueCache, which is already concurrent.

Resolution: Resolved 2026-05-22 — switched _alarms to ConcurrentDictionary<string, AlarmState> so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.

Core.ScriptedAlarms-002

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90`
Status	Resolved

Description: LoadAsync is written to be re-callable — it begins by calling UnsubscribeFromUpstream(), _alarms.Clear(), and _alarmsReferencing.Clear() (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns _shelvingTimer = new Timer(...) without disposing the timer created by a previous LoadAsync call. A second LoadAsync therefore leaks the old Timer and leaves two timers running concurrently against the same _alarms/_evalGate. The old timer's RunShelvingCheck keeps firing forever.

Recommendation: Dispose any existing _shelvingTimer before reassigning it, e.g. _shelvingTimer?.Dispose(); immediately before line 162, inside the _evalGate critical section. If reload is genuinely not supported, instead guard LoadAsync against a second call and document it as one-shot.

Resolution: Resolved 2026-05-22 — added _shelvingTimer?.Dispose() before the timer reassignment in LoadAsync so a second load call does not leak the previous timer.

Core.ScriptedAlarms-003

Field	Value
Severity	Low
Category	Documentation & comments
Location	`ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107`
Status	Resolved

Description: docs/ScriptedAlarms.md (Composition step 3) and the OnUpstreamChange comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the OnEvent emission path as non-blocking / fire-and-forget. In the code, EmitEvent invokes OnEvent?.Invoke(this, evt) synchronously while _evalGate is held (called from EvaluatePredicateToStateAsync line 305 and ApplyAsync line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls AcknowledgeAsync) deadlocks because _evalGate is a non-reentrant SemaphoreSlim(1,1). The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.

Recommendation: Either move EmitEvent outside the _evalGate critical section (collect emissions during the locked section and raise them after Release()), or document explicitly on OnEvent that handlers run under the engine lock, must be fast, and must never call back into the engine.

Resolution: Resolved 2026-05-23 — split EmitEvent into BuildEmission (called under the gate to capture a coherent value-cache snapshot for message-template resolution) and FireEvent (called after _evalGate.Release() so subscribers can re-enter the engine without deadlocking and a slow subscriber no longer blocks concurrent engine operations). Updated ApplyAsync, ReevaluateAsync, ShelvingCheckAsync, and LoadAsync (startup-recovery path) to collect emissions in a pending list and flush after the gate is released; added regression tests for both the re-entry path and a white-box gate-acquirable-from-subscriber check.

Core.ScriptedAlarms-004

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234`
Status	Resolved

Description: During LoadAsync, _upstream.SubscribeTag(path, OnUpstreamChange) is called inside the _evalGate critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside SubscribeTag (a common pattern, and the ITagUpstreamSource contract does not forbid it), the observer callback OnUpstreamChange runs on the calling thread, schedules ReevaluateAsync, which calls _evalGate.WaitAsync. That does not deadlock (the reevaluation task simply blocks until LoadAsync releases the gate), but it can cause a re-evaluation to run against a half-initialised _alarms/index, and the value written to _valueCache on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.

Recommendation: Subscribe to all upstream tags after the seed reads and after _loaded = true, or capture the subscription's first push into the cache and treat SubscribeTag as the single source of truth (drop the separate ReadTag seed). Document the expected ITagUpstreamSource delivery semantics (does SubscribeTag push an initial value?).

Resolution: Resolved 2026-05-22 — split the seed/subscribe loop: ReadTag seeds _valueCache, persisted-state restore runs, _loaded = true is set, then SubscribeTag is called; any synchronous initial push now arrives after _alarms is fully initialised and correctly queues behind the gate.

Core.ScriptedAlarms-005

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424`
Status	Resolved

Description: Dispose sets _disposed = true, disposes _shelvingTimer, and clears _alarms. A RunShelvingCheck callback already in flight on a thread-pool thread can have passed its if (_disposed) return; check (line 367) before Dispose ran, then proceed into ShelvingCheckAsync, which awaits _evalGate and mutates _alarms — concurrently with Dispose's _alarms.Clear() at line 422 (which runs outside _evalGate). Timer.Dispose() does not wait for the running callback to finish. The result is a possible InvalidOperationException from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a ReevaluateAsync in flight from a late upstream push.

Recommendation: Use Timer.Dispose(WaitHandle) (or DisposeAsync) to wait for the callback to drain, and perform _alarms.Clear() under _evalGate (or simply drop the clear — the object is being discarded). Also have ShelvingCheckAsync/ReevaluateAsync re-check _disposed after acquiring the gate before mutating/saving.

Resolution: Resolved 2026-05-22 — added _disposed re-checks in ReevaluateAsync and ShelvingCheckAsync after acquiring _evalGate so late callbacks bail out cleanly; dropped the unsynchronised _alarms.Clear() from Dispose since the object is being discarded and the clear raced concurrent reads.

Core.ScriptedAlarms-006

Field	Value
Severity	Low
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369`
Status	Resolved

Description: OnUpstreamChange and RunShelvingCheck both launch fire-and-forget tasks (_ = ReevaluateAsync(...), _ = ShelvingCheckAsync(...)) with CancellationToken.None. There is no tracking of these in-flight tasks, so Dispose cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.

Recommendation: Track outstanding background tasks (or use a single serialised worker / Channel), and link them to a CancellationTokenSource that Dispose cancels and drains. At minimum, await the in-flight work in Dispose.

Resolution: Resolved 2026-05-23 — added _inFlight HashSet + TrackBackgroundTask(...) helper to register every fire-and-forget ReevaluateAsync/ShelvingCheckAsync task, with a sync ContinueWith continuation that auto-removes on completion. Dispose snapshots the set under its own lock and Task.WhenAll(...).GetAwaiter().GetResult() drains them before returning; OnUpstreamChange also short-circuits when _disposed is set so no new work is queued during shutdown. Regression test exercises the slow-store path: Dispose blocks until the in-flight SaveAsync completes.

Core.ScriptedAlarms-007

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387`
Status	Resolved

Description: Every state mutation calls await _store.SaveAsync(...) and relies on it succeeding. If the production SQL-backed IAlarmStateStore (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in ApplyAsync it surfaces to the Part 9 method caller after the in-memory _alarms entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in ReevaluateAsync/ShelvingCheckAsync it is caught and logged, but again the in-memory _alarms entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (docs/ScriptedAlarms.md State persistence) — that invariant is not actually enforced.

Recommendation: Save before committing the in-memory update, or roll back the in-memory entry if SaveAsync fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.

Resolution: Resolved 2026-05-22 — reordered SaveAsync/_alarms[id]= in ApplyAsync, ReevaluateAsync, and ShelvingCheckAsync so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.

Core.ScriptedAlarms-008

Field	Value
Severity	Low
Category	Performance & resource management
Location	`Part9StateMachine.cs:261-268`
Status	Resolved

Description: AppendComment copies the entire existing comment list into a new List on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The Comments list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every SaveAsync. Over a multi-month uptime this is a slowly growing per-transition cost.

Recommendation: Acceptable for now given audit requirements, but consider an immutable persistent list / ImmutableList<AlarmComment> to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.

Resolution: Resolved 2026-05-23 — switched AlarmConditionState.Comments from IReadOnlyList<AlarmComment> to ImmutableList<AlarmComment> and rewrote AppendComment as existing.Add(...) so each append is O(log n) instead of the prior O(n) copy. ImmutableList<T> still implements IReadOnlyList<T> so existing consumers compile unchanged; the persistence layer continues to store comments as JSON so wire-format is unaffected. Regression test asserts the runtime type is ImmutableList<AlarmComment>.

Core.ScriptedAlarms-009

Field	Value
Severity	Low
Category	Performance & resource management
Location	`ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271`
Status	Won't Fix

Description: BuildReadCache allocates a fresh Dictionary<string, DataValueSnapshot> on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. AlarmPredicateContext is also newly constructed each evaluation (line 281).

Recommendation: Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under _evalGate, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in docs/v2/Galaxy.Performance.md if alarm evaluation is ever soak-tested.

Resolution: Won't Fix 2026-05-23 — per the recommendation, no code change. Documented the known allocation characteristic in docs/v2/Galaxy.Performance.md (new "Scripted-alarm engine — known hot-path allocations" section) so a future soak that surfaces pressure has a noted mitigation (reused per-alarm scratch buffer) and we don't re-find this in a later review.

Core.ScriptedAlarms-010

Field	Value
Severity	Low
Category	Design-document adherence
Location	`ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47`
Status	Resolved

Description: Quality handling is inconsistent across the three places that inspect a DataValueSnapshot.StatusCode. AreInputsReady (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. MessageTemplate.Resolve (line 47) rejects any non-zero status code — including Uncertain — and renders {?}. AlarmPredicateContext.GetTag returns BadNodeIdUnknown (0x80340000) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm activation decision but not good enough to print in the alarm message. docs/ScriptedAlarms.md ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.

Recommendation: Decide one quality policy for "is this input usable" and apply it in both AreInputsReady and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to docs/ScriptedAlarms.md.

Resolution: Resolved 2026-05-23 — documented the deliberate asymmetry. Added an "Input-quality policy" section to docs/ScriptedAlarms.md (table contrasting AreInputsReady's Bad-only rejection with MessageTemplate.Resolve's Good-only acceptance, plus the rationale) and a cross-referencing remarks block on MessageTemplate.Resolve. The two policies are kept distinct on purpose: predicate evaluation accepts Uncertain because the value is still inspectable, while the operator-facing message must render {?} to make the qualifier visible. Regression test locks in both behaviours with a single Uncertain-quality input that activates the alarm and surfaces {?} in the emission message.

Core.ScriptedAlarms-011

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`Part9StateMachine.cs:275`
Status	Resolved

Description: TransitionResult.NoOp(state, reason) takes a reason string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns new(state, EmissionKind.None), identical to None(state). Every call site that passes a carefully-worded reason string is doing dead work, and the comments in Part9StateMachine and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.

Recommendation: Either propagate the reason (add it to TransitionResult and have the engine log it at debug level when emission is None for a no-op), or remove the unused reason parameter and collapse NoOp into None. Update the Part9StateMachine remarks that promise a diagnostic log line.

Resolution: Resolved 2026-05-23 — added a nullable NoOpReason property to TransitionResult (defaulted on the primary constructor so existing positional new TransitionResult(state, kind) call sites remain valid) and propagated it from TransitionResult.NoOp(state, reason). ScriptedAlarmEngine.ApplyAsync now logs the reason at debug level via the alarm's script logger when the transition is a no-op, fulfilling the class-level remarks. Two regression tests assert that NoOp carries the reason and None does not.

Core.ScriptedAlarms-012

Field	Value
Severity	Medium
Category	Testing coverage
Location	`tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs`
Status	Resolved

Description: Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the engine — only the pure Part9StateMachine.ApplyShelvingCheck is tested, never ScriptedAlarmEngine driving the timer with an injectable clock; (2) ConfirmAsync, TimedShelveAsync, UnshelveAsync, EnableAsync engine methods (only Acknowledge, OneShotShelve, Disable, AddComment are exercised); (3) OnEvent subscriber-throws isolation (EmitEvent catch on line 357); (4) IAlarmStateStore.SaveAsync failure handling (finding 007); (5) re-entrant LoadAsync and the timer leak (finding 002); (6) the cold-start AreInputsReady guard with Bad / null / Uncertain inputs. The clock and scriptTimeout constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.

Recommendation: Add engine-level tests that inject a controllable Func<DateTime> clock to drive RunShelvingCheck, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.

Resolution: Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via RunShelvingCheckForTest, ConfirmAsync/TimedShelveAsync/UnshelveAsync/EnableAsync end-to-end, subscriber-exception isolation, store-failure invariant, second-LoadAsync timer-leak regression, and AreInputsReady Bad/Uncertain guard; exposed RunShelvingCheckForTest() internal hook on the engine.

21 KiB Raw Blame History