Files

Joseph Doherty 0001cdd579 fix(scripted-alarms): reuse per-alarm evaluation scratch on the hot path

Core.ScriptedAlarms-009 resolution: replace the per-call Dictionary +
AlarmPredicateContext allocation with a per-alarm reusable AlarmScratch
held in _scratchByAlarmId, refilled in place under _evalGate on each
evaluation. The hot path no longer allocates per upstream tag change.

Why this matters:
  On a busy line where many tags feeding many alarms change frequently,
  the old BuildReadCache allocated a fresh dictionary + context on every
  predicate evaluation — a steady stream of short-lived allocations the
  GC eventually has to reclaim. With the reuse, the dictionary and
  context are allocated once per alarm (on first evaluation) and refilled
  in place across every subsequent re-eval.

Implementation:
  - New private AlarmScratch class holds the reusable
    Dictionary<string, DataValueSnapshot> read cache (pre-sized to the
    alarm's Inputs.Count) and the AlarmPredicateContext that wraps it by
    reference. The context observes refilled values without being
    re-created.
  - ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId on the
    engine, cleared in LoadAsync alongside _alarms so a config-publish
    drops the prior generation's scratch (Inputs / Logger may change).
  - EvaluatePredicateToStateAsync looks up scratch via GetOrAdd, calls
    the new RefillReadCache(Dictionary, IReadOnlySet) helper to clear +
    repopulate the dictionary in place, then runs the predicate against
    the reused context.
  - BuildReadCache removed.

Safety:
  Reuse is serialised under _evalGate which guarantees no two threads
  ever observe the same scratch in a half-refilled state. The
  AlarmPredicateContext is bound to the scratch dictionary by reference,
  so the predicate's ctx.GetTag(path) sees the freshly-refilled values
  rather than a stale snapshot.

Verification:
  - All 66 ScriptedAlarms tests pass (was 63 — three new regression tests
    locking the reuse contract).
  - All 56 VirtualTags tests still pass (unchanged).
  - All 104 Core.Scripting tests still pass (unchanged).

New tests in ScriptedAlarmEngineTests:
  - Reevaluation_reuses_the_same_read_cache_dictionary — asserts
    ReferenceEquals(scratch_before, scratch_after) across two
    evaluations of the same alarm.
  - Reevaluation_reuses_the_same_predicate_context — same, for the
    context.
  - LoadAsync_drops_the_prior_generations_scratch — asserts a config
    publish wipes the prior scratch (so a stale Logger / Inputs can't
    leak into the new generation).

Internal test hooks TryGetScratchReadCacheForTest /
TryGetScratchContextForTest added via the existing
InternalsVisibleTo for the tests project. Kept internal — not part of
the public engine surface.

Docs:
  - docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section
    rewritten as "hot-path allocation reuse" documenting the new
    contract + reuse safety reasoning + the three regression tests.
  - code-reviews/Core.ScriptedAlarms/findings.md -009 flipped
    Won't Fix → Resolved.
  - code-reviews/README.md regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 16:10:09 -04:00

22 KiB

Raw Blame History

Code Review — Core.ScriptedAlarms

Field	Value
Module	`src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms`
Reviewer	Claude Code
Review date	2026-05-22
Commit reviewed	`76d35d1`
Status	Reviewed
Open findings	0

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

#	Category	Result
1	Correctness & logic bugs	Core.ScriptedAlarms-002
2	OtOpcUa conventions	No issues found
3	Concurrency & thread safety	Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006
4	Error handling & resilience	Core.ScriptedAlarms-007
5	Security	No issues found
6	Performance & resource management	Core.ScriptedAlarms-008, Core.ScriptedAlarms-009
7	Design-document adherence	Core.ScriptedAlarms-010
8	Code organization & conventions	Core.ScriptedAlarms-011
9	Testing coverage	Core.ScriptedAlarms-012
10	Documentation & comments	Core.ScriptedAlarms-003

Findings

Core.ScriptedAlarms-001

Field	Value
Severity	High
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368`
Status	Resolved

Description: _alarms is a plain Dictionary<string, AlarmState> (line 42). Every mutation of it (LoadAsync, ApplyAsync, ReevaluateAsync, ShelvingCheckAsync) correctly happens under the _evalGate semaphore, but four read paths touch it with no synchronisation: GetState (line 175), GetAllStates (line 178-179), the LoadedAlarmIds property (line 73), and RunShelvingCheck (line 368, _alarms.Keys.ToArray()). RunShelvingCheck fires from a Timer thread-pool callback and can run concurrently with an ApplyAsync/ReevaluateAsync that is reassigning a dictionary entry. Dictionary is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw InvalidOperationException or return torn state. GetState/GetAllStates are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.

Recommendation: Either switch _alarms to ConcurrentDictionary<string, AlarmState> (entry reassignment via _alarms[id] = ... is already the only write shape, which a ConcurrentDictionary supports atomically), or acquire _evalGate in every reader. A ConcurrentDictionary is the lighter change and matches _valueCache, which is already concurrent.

Resolution: Resolved 2026-05-22 — switched _alarms to ConcurrentDictionary<string, AlarmState> so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.

Core.ScriptedAlarms-002

Field	Value
Severity	Medium
Category	Correctness & logic bugs
Location	`ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90`
Status	Resolved

Description: LoadAsync is written to be re-callable — it begins by calling UnsubscribeFromUpstream(), _alarms.Clear(), and _alarmsReferencing.Clear() (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns _shelvingTimer = new Timer(...) without disposing the timer created by a previous LoadAsync call. A second LoadAsync therefore leaks the old Timer and leaves two timers running concurrently against the same _alarms/_evalGate. The old timer's RunShelvingCheck keeps firing forever.

Recommendation: Dispose any existing _shelvingTimer before reassigning it, e.g. _shelvingTimer?.Dispose(); immediately before line 162, inside the _evalGate critical section. If reload is genuinely not supported, instead guard LoadAsync against a second call and document it as one-shot.

Resolution: Resolved 2026-05-22 — added _shelvingTimer?.Dispose() before the timer reassignment in LoadAsync so a second load call does not leak the previous timer.

Core.ScriptedAlarms-003

Field	Value
Severity	Low
Category	Documentation & comments
Location	`ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107`
Status	Resolved

Description: docs/ScriptedAlarms.md (Composition step 3) and the OnUpstreamChange comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the OnEvent emission path as non-blocking / fire-and-forget. In the code, EmitEvent invokes OnEvent?.Invoke(this, evt) synchronously while _evalGate is held (called from EvaluatePredicateToStateAsync line 305 and ApplyAsync line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls AcknowledgeAsync) deadlocks because _evalGate is a non-reentrant SemaphoreSlim(1,1). The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.

Recommendation: Either move EmitEvent outside the _evalGate critical section (collect emissions during the locked section and raise them after Release()), or document explicitly on OnEvent that handlers run under the engine lock, must be fast, and must never call back into the engine.

Resolution: Resolved 2026-05-23 — split EmitEvent into BuildEmission (called under the gate to capture a coherent value-cache snapshot for message-template resolution) and FireEvent (called after _evalGate.Release() so subscribers can re-enter the engine without deadlocking and a slow subscriber no longer blocks concurrent engine operations). Updated ApplyAsync, ReevaluateAsync, ShelvingCheckAsync, and LoadAsync (startup-recovery path) to collect emissions in a pending list and flush after the gate is released; added regression tests for both the re-entry path and a white-box gate-acquirable-from-subscriber check.

Core.ScriptedAlarms-004

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234`
Status	Resolved

Description: During LoadAsync, _upstream.SubscribeTag(path, OnUpstreamChange) is called inside the _evalGate critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside SubscribeTag (a common pattern, and the ITagUpstreamSource contract does not forbid it), the observer callback OnUpstreamChange runs on the calling thread, schedules ReevaluateAsync, which calls _evalGate.WaitAsync. That does not deadlock (the reevaluation task simply blocks until LoadAsync releases the gate), but it can cause a re-evaluation to run against a half-initialised _alarms/index, and the value written to _valueCache on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.

Recommendation: Subscribe to all upstream tags after the seed reads and after _loaded = true, or capture the subscription's first push into the cache and treat SubscribeTag as the single source of truth (drop the separate ReadTag seed). Document the expected ITagUpstreamSource delivery semantics (does SubscribeTag push an initial value?).

Resolution: Resolved 2026-05-22 — split the seed/subscribe loop: ReadTag seeds _valueCache, persisted-state restore runs, _loaded = true is set, then SubscribeTag is called; any synchronous initial push now arrives after _alarms is fully initialised and correctly queues behind the gate.

Core.ScriptedAlarms-005

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424`
Status	Resolved

Description: Dispose sets _disposed = true, disposes _shelvingTimer, and clears _alarms. A RunShelvingCheck callback already in flight on a thread-pool thread can have passed its if (_disposed) return; check (line 367) before Dispose ran, then proceed into ShelvingCheckAsync, which awaits _evalGate and mutates _alarms — concurrently with Dispose's _alarms.Clear() at line 422 (which runs outside _evalGate). Timer.Dispose() does not wait for the running callback to finish. The result is a possible InvalidOperationException from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a ReevaluateAsync in flight from a late upstream push.

Recommendation: Use Timer.Dispose(WaitHandle) (or DisposeAsync) to wait for the callback to drain, and perform _alarms.Clear() under _evalGate (or simply drop the clear — the object is being discarded). Also have ShelvingCheckAsync/ReevaluateAsync re-check _disposed after acquiring the gate before mutating/saving.

Resolution: Resolved 2026-05-22 — added _disposed re-checks in ReevaluateAsync and ShelvingCheckAsync after acquiring _evalGate so late callbacks bail out cleanly; dropped the unsynchronised _alarms.Clear() from Dispose since the object is being discarded and the clear raced concurrent reads.

Core.ScriptedAlarms-006

Field	Value
Severity	Low
Category	Concurrency & thread safety
Location	`ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369`
Status	Resolved

Description: OnUpstreamChange and RunShelvingCheck both launch fire-and-forget tasks (_ = ReevaluateAsync(...), _ = ShelvingCheckAsync(...)) with CancellationToken.None. There is no tracking of these in-flight tasks, so Dispose cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.

Recommendation: Track outstanding background tasks (or use a single serialised worker / Channel), and link them to a CancellationTokenSource that Dispose cancels and drains. At minimum, await the in-flight work in Dispose.

Resolution: Resolved 2026-05-23 — added _inFlight HashSet + TrackBackgroundTask(...) helper to register every fire-and-forget ReevaluateAsync/ShelvingCheckAsync task, with a sync ContinueWith continuation that auto-removes on completion. Dispose snapshots the set under its own lock and Task.WhenAll(...).GetAwaiter().GetResult() drains them before returning; OnUpstreamChange also short-circuits when _disposed is set so no new work is queued during shutdown. Regression test exercises the slow-store path: Dispose blocks until the in-flight SaveAsync completes.

Core.ScriptedAlarms-007

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387`
Status	Resolved

Description: Every state mutation calls await _store.SaveAsync(...) and relies on it succeeding. If the production SQL-backed IAlarmStateStore (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in ApplyAsync it surfaces to the Part 9 method caller after the in-memory _alarms entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in ReevaluateAsync/ShelvingCheckAsync it is caught and logged, but again the in-memory _alarms entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (docs/ScriptedAlarms.md State persistence) — that invariant is not actually enforced.

Recommendation: Save before committing the in-memory update, or roll back the in-memory entry if SaveAsync fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.

Resolution: Resolved 2026-05-22 — reordered SaveAsync/_alarms[id]= in ApplyAsync, ReevaluateAsync, and ShelvingCheckAsync so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.

Core.ScriptedAlarms-008

Field	Value
Severity	Low
Category	Performance & resource management
Location	`Part9StateMachine.cs:261-268`
Status	Resolved

Description: AppendComment copies the entire existing comment list into a new List on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The Comments list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every SaveAsync. Over a multi-month uptime this is a slowly growing per-transition cost.

Recommendation: Acceptable for now given audit requirements, but consider an immutable persistent list / ImmutableList<AlarmComment> to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.

Resolution: Resolved 2026-05-23 — switched AlarmConditionState.Comments from IReadOnlyList<AlarmComment> to ImmutableList<AlarmComment> and rewrote AppendComment as existing.Add(...) so each append is O(log n) instead of the prior O(n) copy. ImmutableList<T> still implements IReadOnlyList<T> so existing consumers compile unchanged; the persistence layer continues to store comments as JSON so wire-format is unaffected. Regression test asserts the runtime type is ImmutableList<AlarmComment>.

Core.ScriptedAlarms-009

Field	Value
Severity	Low
Category	Performance & resource management
Location	`ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271`
Status	Resolved

Description: BuildReadCache allocates a fresh Dictionary<string, DataValueSnapshot> on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. AlarmPredicateContext is also newly constructed each evaluation (line 281).

Recommendation: Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under _evalGate, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in docs/v2/Galaxy.Performance.md if alarm evaluation is ever soak-tested.

Resolution: Resolved 2026-05-23 — added a per-alarm reusable AlarmScratch (read-cache Dictionary + AlarmPredicateContext) held in _scratchByAlarmId, populated lazily on first evaluation and refilled in place by the new RefillReadCache(Dictionary, IReadOnlySet) helper on every subsequent re-eval. BuildReadCache is gone. The reuse is safe because every evaluation runs under _evalGate; the context wraps the dictionary by reference, so the predicate's ctx.GetTag(path) sees the freshly-refilled values. LoadAsync clears _scratchByAlarmId alongside _alarms so a config-publish drops the prior generation's scratch (Inputs / Logger may change). Regression tests added in ScriptedAlarmEngineTests: Reevaluation_reuses_the_same_read_cache_dictionary, Reevaluation_reuses_the_same_predicate_context, and LoadAsync_drops_the_prior_generations_scratch; internal test hooks TryGetScratchReadCacheForTest / TryGetScratchContextForTest exposed via the existing InternalsVisibleTo. docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section rewritten to document the new reuse contract. Suite now 66 green (was 63).

Core.ScriptedAlarms-010

Field	Value
Severity	Low
Category	Design-document adherence
Location	`ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47`
Status	Resolved

Description: Quality handling is inconsistent across the three places that inspect a DataValueSnapshot.StatusCode. AreInputsReady (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. MessageTemplate.Resolve (line 47) rejects any non-zero status code — including Uncertain — and renders {?}. AlarmPredicateContext.GetTag returns BadNodeIdUnknown (0x80340000) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm activation decision but not good enough to print in the alarm message. docs/ScriptedAlarms.md ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.

Recommendation: Decide one quality policy for "is this input usable" and apply it in both AreInputsReady and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to docs/ScriptedAlarms.md.

Resolution: Resolved 2026-05-23 — documented the deliberate asymmetry. Added an "Input-quality policy" section to docs/ScriptedAlarms.md (table contrasting AreInputsReady's Bad-only rejection with MessageTemplate.Resolve's Good-only acceptance, plus the rationale) and a cross-referencing remarks block on MessageTemplate.Resolve. The two policies are kept distinct on purpose: predicate evaluation accepts Uncertain because the value is still inspectable, while the operator-facing message must render {?} to make the qualifier visible. Regression test locks in both behaviours with a single Uncertain-quality input that activates the alarm and surfaces {?} in the emission message.

Core.ScriptedAlarms-011

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`Part9StateMachine.cs:275`
Status	Resolved

Description: TransitionResult.NoOp(state, reason) takes a reason string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns new(state, EmissionKind.None), identical to None(state). Every call site that passes a carefully-worded reason string is doing dead work, and the comments in Part9StateMachine and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.

Recommendation: Either propagate the reason (add it to TransitionResult and have the engine log it at debug level when emission is None for a no-op), or remove the unused reason parameter and collapse NoOp into None. Update the Part9StateMachine remarks that promise a diagnostic log line.

Resolution: Resolved 2026-05-23 — added a nullable NoOpReason property to TransitionResult (defaulted on the primary constructor so existing positional new TransitionResult(state, kind) call sites remain valid) and propagated it from TransitionResult.NoOp(state, reason). ScriptedAlarmEngine.ApplyAsync now logs the reason at debug level via the alarm's script logger when the transition is a no-op, fulfilling the class-level remarks. Two regression tests assert that NoOp carries the reason and None does not.

Core.ScriptedAlarms-012

Field	Value
Severity	Medium
Category	Testing coverage
Location	`tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs`
Status	Resolved

Description: Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the engine — only the pure Part9StateMachine.ApplyShelvingCheck is tested, never ScriptedAlarmEngine driving the timer with an injectable clock; (2) ConfirmAsync, TimedShelveAsync, UnshelveAsync, EnableAsync engine methods (only Acknowledge, OneShotShelve, Disable, AddComment are exercised); (3) OnEvent subscriber-throws isolation (EmitEvent catch on line 357); (4) IAlarmStateStore.SaveAsync failure handling (finding 007); (5) re-entrant LoadAsync and the timer leak (finding 002); (6) the cold-start AreInputsReady guard with Bad / null / Uncertain inputs. The clock and scriptTimeout constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.

Recommendation: Add engine-level tests that inject a controllable Func<DateTime> clock to drive RunShelvingCheck, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.

Resolution: Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via RunShelvingCheckForTest, ConfirmAsync/TimedShelveAsync/UnshelveAsync/EnableAsync end-to-end, subscriber-exception isolation, store-failure invariant, second-LoadAsync timer-leak regression, and AreInputsReady Bad/Uncertain guard; exposed RunShelvingCheckForTest() internal hook on the engine.

22 KiB Raw Blame History