Files

T

Joseph Doherty 272a9da61e review(Core.ScriptedAlarms): stop shelving timer on failed reload + drop dead branch

Re-review at 7286d320. -015: dispose shelving timer at top of LoadAsync so a failed
reload doesn't leave it firing against partially-cleared state + test. -014: make
pendingEmissions required (removes unreachable fire-under-gate branch that could
reintroduce the -003 deadlock).

2026-06-19 11:21:35 -04:00

31 KiB

Raw Blame History

Code Review — Core.ScriptedAlarms

Field	Value
Module	`src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms`
Reviewer	Claude Code
Review date	2026-06-19
Commit reviewed	`7286d320`
Status	Reviewed
Open findings	0

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

2026-05-22 review (commit `76d35d1`)

#	Category	Result
1	Correctness & logic bugs	Core.ScriptedAlarms-002
2	OtOpcUa conventions	No issues found
3	Concurrency & thread safety	Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006
4	Error handling & resilience	Core.ScriptedAlarms-007
5	Security	No issues found
6	Performance & resource management	Core.ScriptedAlarms-008, Core.ScriptedAlarms-009
7	Design-document adherence	Core.ScriptedAlarms-010
8	Code organization & conventions	Core.ScriptedAlarms-011
9	Testing coverage	Core.ScriptedAlarms-012
10	Documentation & comments	Core.ScriptedAlarms-003

2026-05-23 re-review (commit `a9be809`)

Focused re-review of the Core.ScriptedAlarms-009 resolution (commit 0001cdd) — new AlarmScratch class, _scratchByAlarmId ConcurrentDictionary, RefillReadCache helper, and internal test accessors. Only the changed/new code since 76d35d1 was re-examined; existing closed findings stay as audit trail.

#	Category	Result
1	Correctness & logic bugs	No issues found
2	OtOpcUa conventions	No issues found
3	Concurrency & thread safety	No issues found
4	Error handling & resilience	No issues found
5	Security	No issues found
6	Performance & resource management	No issues found
7	Design-document adherence	No issues found
8	Code organization & conventions	No issues found
9	Testing coverage	No issues found
10	Documentation & comments	Core.ScriptedAlarms-013

2026-06-19 re-review (commit `7286d320`)

Full re-review at HEAD. Covers all commits since a9be809: AlarmPredicateContext moved to Core.Scripting.Abstractions; ScriptedAlarmEvent gained Comment and HistorizeToAveva fields; AlarmAcknowledgeRequest.OperatorUser wired through ScriptedAlarmSource; CompiledScriptCache routing added; _alarms.Clear() in Dispose re-enabled under -016; mass XML-doc backfill; CVE patches. All prior findings remain Resolved. Cross-module observation noted below checklist (not recorded as a module finding — root cause is in Server).

#	Category	Result
1	Correctness & logic bugs	No issues found
2	OtOpcUa conventions	No issues found
3	Concurrency & thread safety	No issues found
4	Error handling & resilience	Core.ScriptedAlarms-015
5	Security	No issues found
6	Performance & resource management	No issues found
7	Design-document adherence	No issues found
8	Code organization & conventions	Core.ScriptedAlarms-014
9	Testing coverage	No issues found
10	Documentation & comments	No issues found

Cross-module observation (2026-06-19): OtOpcUaNodeManager.cs:677 constructs an AlarmCommand with User: string.Empty for the system-timer OnTimedUnshelve bypass path. This empty string flows to ScriptedAlarmHostActor.OnAlarmCommand → _engine.UnshelveAsync("", ...) → Part9StateMachine.ApplyUnshelve which throws ArgumentException for IsNullOrWhiteSpace(user). The exception is caught and logged at the actor level (not a crash), but the auto-unshelve silently fails. Root cause is in the Server module (OtOpcUaNodeManager.cs); the engine's user-validation guard is correct. Fix belongs in a Server module review.

Findings

Core.ScriptedAlarms-001

Field	Value
Severity	High
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368`
Status	Resolved

Description: _alarms is a plain Dictionary<string, AlarmState> (line 42). Every mutation of it (LoadAsync, ApplyAsync, ReevaluateAsync, ShelvingCheckAsync) correctly happens under the _evalGate semaphore, but four read paths touch it with no synchronisation: GetState (line 175), GetAllStates (line 178-179), the LoadedAlarmIds property (line 73), and RunShelvingCheck (line 368, _alarms.Keys.ToArray()). RunShelvingCheck fires from a Timer thread-pool callback and can run concurrently with an ApplyAsync/ReevaluateAsync that is reassigning a dictionary entry. Dictionary is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw InvalidOperationException or return torn state. GetState/GetAllStates are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.

Recommendation: Either switch _alarms to ConcurrentDictionary<string, AlarmState> (entry reassignment via _alarms[id] = ... is already the only write shape, which a ConcurrentDictionary supports atomically), or acquire _evalGate in every reader. A ConcurrentDictionary is the lighter change and matches _valueCache, which is already concurrent.

Resolution: Resolved 2026-05-22 — switched _alarms to ConcurrentDictionary<string, AlarmState> so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.

Core.ScriptedAlarms-002

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90`
Status	Resolved

Description: LoadAsync is written to be re-callable — it begins by calling UnsubscribeFromUpstream(), _alarms.Clear(), and _alarmsReferencing.Clear() (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns _shelvingTimer = new Timer(...) without disposing the timer created by a previous LoadAsync call. A second LoadAsync therefore leaks the old Timer and leaves two timers running concurrently against the same _alarms/_evalGate. The old timer's RunShelvingCheck keeps firing forever.

Recommendation: Dispose any existing _shelvingTimer before reassigning it, e.g. _shelvingTimer?.Dispose(); immediately before line 162, inside the _evalGate critical section. If reload is genuinely not supported, instead guard LoadAsync against a second call and document it as one-shot.

Resolution: Resolved 2026-05-22 — added _shelvingTimer?.Dispose() before the timer reassignment in LoadAsync so a second load call does not leak the previous timer.

Core.ScriptedAlarms-003

Field	Value
Severity	Low
Category	Documentation & comments
Location	`ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107`
Status	Resolved

Description: docs/ScriptedAlarms.md (Composition step 3) and the OnUpstreamChange comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the OnEvent emission path as non-blocking / fire-and-forget. In the code, EmitEvent invokes OnEvent?.Invoke(this, evt) synchronously while _evalGate is held (called from EvaluatePredicateToStateAsync line 305 and ApplyAsync line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls AcknowledgeAsync) deadlocks because _evalGate is a non-reentrant SemaphoreSlim(1,1). The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.

Recommendation: Either move EmitEvent outside the _evalGate critical section (collect emissions during the locked section and raise them after Release()), or document explicitly on OnEvent that handlers run under the engine lock, must be fast, and must never call back into the engine.

Resolution: Resolved 2026-05-23 — split EmitEvent into BuildEmission (called under the gate to capture a coherent value-cache snapshot for message-template resolution) and FireEvent (called after _evalGate.Release() so subscribers can re-enter the engine without deadlocking and a slow subscriber no longer blocks concurrent engine operations). Updated ApplyAsync, ReevaluateAsync, ShelvingCheckAsync, and LoadAsync (startup-recovery path) to collect emissions in a pending list and flush after the gate is released; added regression tests for both the re-entry path and a white-box gate-acquirable-from-subscriber check.

Core.ScriptedAlarms-004

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234`
Status	Resolved

Description: During LoadAsync, _upstream.SubscribeTag(path, OnUpstreamChange) is called inside the _evalGate critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside SubscribeTag (a common pattern, and the ITagUpstreamSource contract does not forbid it), the observer callback OnUpstreamChange runs on the calling thread, schedules ReevaluateAsync, which calls _evalGate.WaitAsync. That does not deadlock (the reevaluation task simply blocks until LoadAsync releases the gate), but it can cause a re-evaluation to run against a half-initialised _alarms/index, and the value written to _valueCache on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.

Recommendation: Subscribe to all upstream tags after the seed reads and after _loaded = true, or capture the subscription's first push into the cache and treat SubscribeTag as the single source of truth (drop the separate ReadTag seed). Document the expected ITagUpstreamSource delivery semantics (does SubscribeTag push an initial value?).

Resolution: Resolved 2026-05-22 — split the seed/subscribe loop: ReadTag seeds _valueCache, persisted-state restore runs, _loaded = true is set, then SubscribeTag is called; any synchronous initial push now arrives after _alarms is fully initialised and correctly queues behind the gate.

Core.ScriptedAlarms-005

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424`
Status	Resolved

Description: Dispose sets _disposed = true, disposes _shelvingTimer, and clears _alarms. A RunShelvingCheck callback already in flight on a thread-pool thread can have passed its if (_disposed) return; check (line 367) before Dispose ran, then proceed into ShelvingCheckAsync, which awaits _evalGate and mutates _alarms — concurrently with Dispose's _alarms.Clear() at line 422 (which runs outside _evalGate). Timer.Dispose() does not wait for the running callback to finish. The result is a possible InvalidOperationException from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a ReevaluateAsync in flight from a late upstream push.

Recommendation: Use Timer.Dispose(WaitHandle) (or DisposeAsync) to wait for the callback to drain, and perform _alarms.Clear() under _evalGate (or simply drop the clear — the object is being discarded). Also have ShelvingCheckAsync/ReevaluateAsync re-check _disposed after acquiring the gate before mutating/saving.

Resolution: Resolved 2026-05-22 — added _disposed re-checks in ReevaluateAsync and ShelvingCheckAsync after acquiring _evalGate so late callbacks bail out cleanly; dropped the unsynchronised _alarms.Clear() from Dispose since the object is being discarded and the clear raced concurrent reads.

Core.ScriptedAlarms-006

Field	Value
Severity	Low
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369`
Status	Resolved

Description: OnUpstreamChange and RunShelvingCheck both launch fire-and-forget tasks (_ = ReevaluateAsync(...), _ = ShelvingCheckAsync(...)) with CancellationToken.None. There is no tracking of these in-flight tasks, so Dispose cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.

Recommendation: Track outstanding background tasks (or use a single serialised worker / Channel), and link them to a CancellationTokenSource that Dispose cancels and drains. At minimum, await the in-flight work in Dispose.

Resolution: Resolved 2026-05-23 — added _inFlight HashSet + TrackBackgroundTask(...) helper to register every fire-and-forget ReevaluateAsync/ShelvingCheckAsync task, with a sync ContinueWith continuation that auto-removes on completion. Dispose snapshots the set under its own lock and Task.WhenAll(...).GetAwaiter().GetResult() drains them before returning; OnUpstreamChange also short-circuits when _disposed is set so no new work is queued during shutdown. Regression test exercises the slow-store path: Dispose blocks until the in-flight SaveAsync completes.

Core.ScriptedAlarms-007

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387`
Status	Resolved

Description: Every state mutation calls await _store.SaveAsync(...) and relies on it succeeding. If the production SQL-backed IAlarmStateStore (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in ApplyAsync it surfaces to the Part 9 method caller after the in-memory _alarms entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in ReevaluateAsync/ShelvingCheckAsync it is caught and logged, but again the in-memory _alarms entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (docs/ScriptedAlarms.md State persistence) — that invariant is not actually enforced.

Recommendation: Save before committing the in-memory update, or roll back the in-memory entry if SaveAsync fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.

Resolution: Resolved 2026-05-22 — reordered SaveAsync/_alarms[id]= in ApplyAsync, ReevaluateAsync, and ShelvingCheckAsync so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.

Core.ScriptedAlarms-008

Field	Value
Severity	Low
Category	Performance & resource management
Location	`Part9StateMachine.cs:261-268`
Status	Resolved

Description: AppendComment copies the entire existing comment list into a new List on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The Comments list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every SaveAsync. Over a multi-month uptime this is a slowly growing per-transition cost.

Recommendation: Acceptable for now given audit requirements, but consider an immutable persistent list / ImmutableList<AlarmComment> to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.

Resolution: Resolved 2026-05-23 — switched AlarmConditionState.Comments from IReadOnlyList<AlarmComment> to ImmutableList<AlarmComment> and rewrote AppendComment as existing.Add(...) so each append is O(log n) instead of the prior O(n) copy. ImmutableList<T> still implements IReadOnlyList<T> so existing consumers compile unchanged; the persistence layer continues to store comments as JSON so wire-format is unaffected. Regression test asserts the runtime type is ImmutableList<AlarmComment>.

Core.ScriptedAlarms-009

Field	Value
Severity	Low
Category	Performance & resource management
Location	`ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271`
Status	Resolved

Description: BuildReadCache allocates a fresh Dictionary<string, DataValueSnapshot> on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. AlarmPredicateContext is also newly constructed each evaluation (line 281).

Recommendation: Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under _evalGate, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in docs/v2/Galaxy.Performance.md if alarm evaluation is ever soak-tested.

Resolution: Resolved 2026-05-23 — added a per-alarm reusable AlarmScratch (read-cache Dictionary + AlarmPredicateContext) held in _scratchByAlarmId, populated lazily on first evaluation and refilled in place by the new RefillReadCache(Dictionary, IReadOnlySet) helper on every subsequent re-eval. BuildReadCache is gone. The reuse is safe because every evaluation runs under _evalGate; the context wraps the dictionary by reference, so the predicate's ctx.GetTag(path) sees the freshly-refilled values. LoadAsync clears _scratchByAlarmId alongside _alarms so a config-publish drops the prior generation's scratch (Inputs / Logger may change). Regression tests added in ScriptedAlarmEngineTests: Reevaluation_reuses_the_same_read_cache_dictionary, Reevaluation_reuses_the_same_predicate_context, and LoadAsync_drops_the_prior_generations_scratch; internal test hooks TryGetScratchReadCacheForTest / TryGetScratchContextForTest exposed via the existing InternalsVisibleTo. docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section rewritten to document the new reuse contract. Suite now 66 green (was 63).

Core.ScriptedAlarms-010

Field	Value
Severity	Low
Category	Design-document adherence
Location	`ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47`
Status	Resolved

Description: Quality handling is inconsistent across the three places that inspect a DataValueSnapshot.StatusCode. AreInputsReady (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. MessageTemplate.Resolve (line 47) rejects any non-zero status code — including Uncertain — and renders {?}. AlarmPredicateContext.GetTag returns BadNodeIdUnknown (0x80340000) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm activation decision but not good enough to print in the alarm message. docs/ScriptedAlarms.md ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.

Recommendation: Decide one quality policy for "is this input usable" and apply it in both AreInputsReady and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to docs/ScriptedAlarms.md.

Resolution: Resolved 2026-05-23 — documented the deliberate asymmetry. Added an "Input-quality policy" section to docs/ScriptedAlarms.md (table contrasting AreInputsReady's Bad-only rejection with MessageTemplate.Resolve's Good-only acceptance, plus the rationale) and a cross-referencing remarks block on MessageTemplate.Resolve. The two policies are kept distinct on purpose: predicate evaluation accepts Uncertain because the value is still inspectable, while the operator-facing message must render {?} to make the qualifier visible. Regression test locks in both behaviours with a single Uncertain-quality input that activates the alarm and surfaces {?} in the emission message.

Core.ScriptedAlarms-011

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`Part9StateMachine.cs:275`
Status	Resolved

Description: TransitionResult.NoOp(state, reason) takes a reason string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns new(state, EmissionKind.None), identical to None(state). Every call site that passes a carefully-worded reason string is doing dead work, and the comments in Part9StateMachine and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.

Recommendation: Either propagate the reason (add it to TransitionResult and have the engine log it at debug level when emission is None for a no-op), or remove the unused reason parameter and collapse NoOp into None. Update the Part9StateMachine remarks that promise a diagnostic log line.

Resolution: Resolved 2026-05-23 — added a nullable NoOpReason property to TransitionResult (defaulted on the primary constructor so existing positional new TransitionResult(state, kind) call sites remain valid) and propagated it from TransitionResult.NoOp(state, reason). ScriptedAlarmEngine.ApplyAsync now logs the reason at debug level via the alarm's script logger when the transition is a no-op, fulfilling the class-level remarks. Two regression tests assert that NoOp carries the reason and None does not.

Core.ScriptedAlarms-012

Field	Value
Severity	Medium
Category	Testing coverage
Location	`tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs`
Status	Resolved

Description: Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the engine — only the pure Part9StateMachine.ApplyShelvingCheck is tested, never ScriptedAlarmEngine driving the timer with an injectable clock; (2) ConfirmAsync, TimedShelveAsync, UnshelveAsync, EnableAsync engine methods (only Acknowledge, OneShotShelve, Disable, AddComment are exercised); (3) OnEvent subscriber-throws isolation (EmitEvent catch on line 357); (4) IAlarmStateStore.SaveAsync failure handling (finding 007); (5) re-entrant LoadAsync and the timer leak (finding 002); (6) the cold-start AreInputsReady guard with Bad / null / Uncertain inputs. The clock and scriptTimeout constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.

Recommendation: Add engine-level tests that inject a controllable Func<DateTime> clock to drive RunShelvingCheck, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.

Resolution: Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via RunShelvingCheckForTest, ConfirmAsync/TimedShelveAsync/UnshelveAsync/EnableAsync end-to-end, subscriber-exception isolation, store-failure invariant, second-LoadAsync timer-leak regression, and AreInputsReady Bad/Uncertain guard; exposed RunShelvingCheckForTest() internal hook on the engine.

Core.ScriptedAlarms-013

Field	Value
Severity	Low
Category	Documentation & comments
Location	`ScriptedAlarmEngine.cs:66-81`
Status	Resolved

Description: The new internal test accessors TryGetScratchReadCacheForTest and TryGetScratchContextForTest (introduced by the Core.ScriptedAlarms-009 resolution at 0001cdd) return the live per-alarm scratch — the same Dictionary<string, DataValueSnapshot> instance the engine clears and refills in RefillReadCache under _evalGate, plus the AlarmPredicateContext that wraps it by reference. The XML docs describe the intended use ("assert the scratch is reused across evaluations (two reads return the same instance)") but do not explicitly warn that:

The returned IReadOnlyDictionary is the engine's mutable working set. Enumerating it from a test thread while the engine is mid-evaluation (e.g. during a ReevaluateAsync queued by OnUpstreamChange, or a ShelvingCheckAsync callback) is a concurrent-read-while-writer scenario against a plain Dictionary — undefined behaviour, can throw InvalidOperationException or return torn data.
Reference-equality comparisons (ReferenceEquals(a, b)) and single-key indexer reads (dict["Temp"]) on a quiesced engine are the only safe uses. The existing regression tests stay within those bounds, but a future test author has no in-code signal that broader reads are unsafe.

The engine itself is correct — RefillReadCache runs only under _evalGate, so the engine never tears its own state. The risk is purely on the test-side contract.

Recommendation: Add a <remarks> block to both TryGetScratchReadCacheForTest and TryGetScratchContextForTest stating that the returned references point at live engine state, that reads are only safe when the engine is known to be idle (no in-flight ReevaluateAsync/ShelvingCheckAsync/LoadAsync), and that the intended uses are reference-identity assertions plus single-key lookups against a quiesced engine — never enumeration. No code change required; the engine's correctness depends on _evalGate, which is already documented.

Resolution: Resolved 2026-05-23 — applied the recommendation verbatim. Added a <remarks> block to each of TryGetScratchReadCacheForTest and TryGetScratchContextForTest documenting the synchronization contract: the returned references point at live engine state refilled in place under _evalGate, enumeration during an in-flight evaluation is a concurrent-read-while-writer scenario against a plain Dictionary (undefined behaviour), and the only safe uses are reference-identity comparisons + single-key reads against a quiesced engine. No code change required — the engine's correctness was always there; only the test-side contract was undocumented.

Core.ScriptedAlarms-014

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`ScriptedAlarmEngine.cs:539-540` (pre-fix)
Status	Resolved

Description: EvaluatePredicateToStateAsync had a nullable optional parameter List<ScriptedAlarmEvent>? pendingEmissions = null with a fallback branch else FireEvent(evt); // LoadAsync path: no caller-supplied list, fire here. The fallback was dead code — both callers (LoadAsync at line 267 and ReevaluateAsync at line 453) always pass an explicit pending list, so the pendingEmissions is null branch was never reachable. The comment was actively misleading: it implied a valid caller path that did not exist. Had a future developer called the method with null (or with a new call site that omitted the argument), the event would have been fired under _evalGate — directly reintroducing the deadlock hazard fixed by Core.ScriptedAlarms-003.

Recommendation: Remove the nullable default and dead branch; make the parameter required (List<ScriptedAlarmEvent> pendingEmissions) so the compiler enforces the contract that all callers must supply a pending list.

Resolution: Resolved 2026-06-19 — changed pendingEmissions from List<ScriptedAlarmEvent>? pendingEmissions = null to required List<ScriptedAlarmEvent> pendingEmissions; removed the dead else FireEvent(evt) branch and its misleading comment; added a <remarks> doc explaining why the parameter is required (preventing a future caller from accidentally firing events under the gate). Both existing callers already passed a non-null list so no call sites changed. Build and all 71 tests pass.

Core.ScriptedAlarms-015

Field	Value
Severity	Low
Category	Error handling & resilience
Location	`ScriptedAlarmEngine.cs:284-298` (pre-fix) / `ScriptedAlarmEngine.cs:184-198` (post-fix)
Status	Resolved

Description: LoadAsync disposed _shelvingTimer (the prior generation's 5-second timer) AFTER the compileFailures guard on line 241. If a reload call threw at the compile step (all alarms fail to compile), execution never reached the _shelvingTimer?.Dispose() call. The prior-generation timer kept firing against whatever _alarms remained after the partial clear + partial recompile — a window of inconsistency between the failed reload and the eventual Dispose() call. Dispose() itself does call _shelvingTimer?.Dispose() on line 734, so there is no permanent resource leak; the risk is unexpected shelving-check callbacks operating on partially-cleared state during that window. On a first-load failure (where _loaded remains false) the issue is benign since EnsureLoaded() gates all public API calls. On a reload failure (previously loaded) the timer runs against whatever alarms happened to compile before the bad ones.

Recommendation: Move _shelvingTimer?.Dispose(); _shelvingTimer = null; to the very start of the try block, alongside UnsubscribeFromUpstream(), so the prior timer is always stopped before _alarms is cleared — regardless of whether the reload succeeds or fails.

Resolution: Resolved 2026-06-19 — moved _shelvingTimer?.Dispose(); _shelvingTimer = null; to the start of the try block in LoadAsync, before UnsubscribeFromUpstream() and _alarms.Clear(). Removed the now-redundant _shelvingTimer?.Dispose() that previously appeared just before the new-timer assignment. Updated the comment at the new-timer assignment site to note that _shelvingTimer is guaranteed null here. Added regression test Failed_reload_leaves_engine_recoverable_and_disposes_cleanly documenting that after a failed reload + successful third load, Dispose() completes cleanly with no background tasks outliving the engine.

31 KiB Raw Blame History