Mark Core.ScriptedAlarms-002, -004, -005, -007, -012 as Resolved with one-line descriptions. Update open-findings count from 11 to 6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
Code Review — Core.ScriptedAlarms
| Field | Value |
|---|---|
| Module | src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | 76d35d1 |
| Status | Reviewed |
| Open findings | 6 |
Checklist coverage
A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.ScriptedAlarms-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006 |
| 4 | Error handling & resilience | Core.ScriptedAlarms-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.ScriptedAlarms-008, Core.ScriptedAlarms-009 |
| 7 | Design-document adherence | Core.ScriptedAlarms-010 |
| 8 | Code organization & conventions | Core.ScriptedAlarms-011 |
| 9 | Testing coverage | Core.ScriptedAlarms-012 |
| 10 | Documentation & comments | Core.ScriptedAlarms-003 |
Findings
Core.ScriptedAlarms-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | ScriptedAlarmEngine.cs:175, ScriptedAlarmEngine.cs:178, ScriptedAlarmEngine.cs:73, ScriptedAlarmEngine.cs:368 |
| Status | Resolved |
Description: _alarms is a plain Dictionary<string, AlarmState> (line 42). Every mutation of it (LoadAsync, ApplyAsync, ReevaluateAsync, ShelvingCheckAsync) correctly happens under the _evalGate semaphore, but four read paths touch it with no synchronisation: GetState (line 175), GetAllStates (line 178-179), the LoadedAlarmIds property (line 73), and RunShelvingCheck (line 368, _alarms.Keys.ToArray()). RunShelvingCheck fires from a Timer thread-pool callback and can run concurrently with an ApplyAsync/ReevaluateAsync that is reassigning a dictionary entry. Dictionary is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw InvalidOperationException or return torn state. GetState/GetAllStates are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.
Recommendation: Either switch _alarms to ConcurrentDictionary<string, AlarmState> (entry reassignment via _alarms[id] = ... is already the only write shape, which a ConcurrentDictionary supports atomically), or acquire _evalGate in every reader. A ConcurrentDictionary is the lighter change and matches _valueCache, which is already concurrent.
Resolution: Resolved 2026-05-22 — switched _alarms to ConcurrentDictionary<string, AlarmState> so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.
Core.ScriptedAlarms-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | ScriptedAlarmEngine.cs:162, ScriptedAlarmEngine.cs:90 |
| Status | Resolved |
Description: LoadAsync is written to be re-callable — it begins by calling UnsubscribeFromUpstream(), _alarms.Clear(), and _alarmsReferencing.Clear() (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns _shelvingTimer = new Timer(...) without disposing the timer created by a previous LoadAsync call. A second LoadAsync therefore leaks the old Timer and leaves two timers running concurrently against the same _alarms/_evalGate. The old timer's RunShelvingCheck keeps firing forever.
Recommendation: Dispose any existing _shelvingTimer before reassigning it, e.g. _shelvingTimer?.Dispose(); immediately before line 162, inside the _evalGate critical section. If reload is genuinely not supported, instead guard LoadAsync against a second call and document it as one-shot.
Resolution: Resolved 2026-05-22 — added _shelvingTimer?.Dispose() before the timer reassignment in LoadAsync so a second load call does not leak the previous timer.
Core.ScriptedAlarms-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | ScriptedAlarmEngine.cs:343, docs/ScriptedAlarms.md:107 |
| Status | Open |
Description: docs/ScriptedAlarms.md (Composition step 3) and the OnUpstreamChange comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the OnEvent emission path as non-blocking / fire-and-forget. In the code, EmitEvent invokes OnEvent?.Invoke(this, evt) synchronously while _evalGate is held (called from EvaluatePredicateToStateAsync line 305 and ApplyAsync line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls AcknowledgeAsync) deadlocks because _evalGate is a non-reentrant SemaphoreSlim(1,1). The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.
Recommendation: Either move EmitEvent outside the _evalGate critical section (collect emissions during the locked section and raise them after Release()), or document explicitly on OnEvent that handlers run under the engine lock, must be fast, and must never call back into the engine.
Resolution: (open)
Core.ScriptedAlarms-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | ScriptedAlarmEngine.cs:138-143, ScriptedAlarmEngine.cs:227-234 |
| Status | Resolved |
Description: During LoadAsync, _upstream.SubscribeTag(path, OnUpstreamChange) is called inside the _evalGate critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside SubscribeTag (a common pattern, and the ITagUpstreamSource contract does not forbid it), the observer callback OnUpstreamChange runs on the calling thread, schedules ReevaluateAsync, which calls _evalGate.WaitAsync. That does not deadlock (the reevaluation task simply blocks until LoadAsync releases the gate), but it can cause a re-evaluation to run against a half-initialised _alarms/index, and the value written to _valueCache on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.
Recommendation: Subscribe to all upstream tags after the seed reads and after _loaded = true, or capture the subscription's first push into the cache and treat SubscribeTag as the single source of truth (drop the separate ReadTag seed). Document the expected ITagUpstreamSource delivery semantics (does SubscribeTag push an initial value?).
Resolution: Resolved 2026-05-22 — split the seed/subscribe loop: ReadTag seeds _valueCache, persisted-state restore runs, _loaded = true is set, then SubscribeTag is called; any synchronous initial push now arrives after _alarms is fully initialised and correctly queues behind the gate.
Core.ScriptedAlarms-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | ScriptedAlarmEngine.cs:365-369, ScriptedAlarmEngine.cs:416-424 |
| Status | Resolved |
Description: Dispose sets _disposed = true, disposes _shelvingTimer, and clears _alarms. A RunShelvingCheck callback already in flight on a thread-pool thread can have passed its if (_disposed) return; check (line 367) before Dispose ran, then proceed into ShelvingCheckAsync, which awaits _evalGate and mutates _alarms — concurrently with Dispose's _alarms.Clear() at line 422 (which runs outside _evalGate). Timer.Dispose() does not wait for the running callback to finish. The result is a possible InvalidOperationException from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a ReevaluateAsync in flight from a late upstream push.
Recommendation: Use Timer.Dispose(WaitHandle) (or DisposeAsync) to wait for the callback to drain, and perform _alarms.Clear() under _evalGate (or simply drop the clear — the object is being discarded). Also have ShelvingCheckAsync/ReevaluateAsync re-check _disposed after acquiring the gate before mutating/saving.
Resolution: Resolved 2026-05-22 — added _disposed re-checks in ReevaluateAsync and ShelvingCheckAsync after acquiring _evalGate so late callbacks bail out cleanly; dropped the unsynchronised _alarms.Clear() from Dispose since the object is being discarded and the clear raced concurrent reads.
Core.ScriptedAlarms-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | ScriptedAlarmEngine.cs:232, ScriptedAlarmEngine.cs:369 |
| Status | Open |
Description: OnUpstreamChange and RunShelvingCheck both launch fire-and-forget tasks (_ = ReevaluateAsync(...), _ = ShelvingCheckAsync(...)) with CancellationToken.None. There is no tracking of these in-flight tasks, so Dispose cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.
Recommendation: Track outstanding background tasks (or use a single serialised worker / Channel), and link them to a CancellationTokenSource that Dispose cancels and drains. At minimum, await the in-flight work in Dispose.
Resolution: (open)
Core.ScriptedAlarms-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | ScriptedAlarmEngine.cs:216, ScriptedAlarmEngine.cs:251, ScriptedAlarmEngine.cs:154, ScriptedAlarmEngine.cs:387 |
| Status | Resolved |
Description: Every state mutation calls await _store.SaveAsync(...) and relies on it succeeding. If the production SQL-backed IAlarmStateStore (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in ApplyAsync it surfaces to the Part 9 method caller after the in-memory _alarms entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in ReevaluateAsync/ShelvingCheckAsync it is caught and logged, but again the in-memory _alarms entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (docs/ScriptedAlarms.md State persistence) — that invariant is not actually enforced.
Recommendation: Save before committing the in-memory update, or roll back the in-memory entry if SaveAsync fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.
Resolution: Resolved 2026-05-22 — reordered SaveAsync/_alarms[id]= in ApplyAsync, ReevaluateAsync, and ShelvingCheckAsync so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.
Core.ScriptedAlarms-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | Part9StateMachine.cs:261-268 |
| Status | Open |
Description: AppendComment copies the entire existing comment list into a new List on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The Comments list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every SaveAsync. Over a multi-month uptime this is a slowly growing per-transition cost.
Recommendation: Acceptable for now given audit requirements, but consider an immutable persistent list / ImmutableList<AlarmComment> to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.
Resolution: (open)
Core.ScriptedAlarms-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | ScriptedAlarmEngine.cs:309-315, ScriptedAlarmEngine.cs:271 |
| Status | Open |
Description: BuildReadCache allocates a fresh Dictionary<string, DataValueSnapshot> on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. AlarmPredicateContext is also newly constructed each evaluation (line 281).
Recommendation: Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under _evalGate, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in docs/v2/Galaxy.Performance.md if alarm evaluation is ever soak-tested.
Resolution: (open)
Core.ScriptedAlarms-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | ScriptedAlarmEngine.cs:325-336, AlarmPredicateContext.cs:33-40, MessageTemplate.cs:47 |
| Status | Open |
Description: Quality handling is inconsistent across the three places that inspect a DataValueSnapshot.StatusCode. AreInputsReady (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. MessageTemplate.Resolve (line 47) rejects any non-zero status code — including Uncertain — and renders {?}. AlarmPredicateContext.GetTag returns BadNodeIdUnknown (0x80340000) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm activation decision but not good enough to print in the alarm message. docs/ScriptedAlarms.md ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.
Recommendation: Decide one quality policy for "is this input usable" and apply it in both AreInputsReady and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to docs/ScriptedAlarms.md.
Resolution: (open)
Core.ScriptedAlarms-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | Part9StateMachine.cs:275 |
| Status | Open |
Description: TransitionResult.NoOp(state, reason) takes a reason string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns new(state, EmissionKind.None), identical to None(state). Every call site that passes a carefully-worded reason string is doing dead work, and the comments in Part9StateMachine and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.
Recommendation: Either propagate the reason (add it to TransitionResult and have the engine log it at debug level when emission is None for a no-op), or remove the unused reason parameter and collapse NoOp into None. Update the Part9StateMachine remarks that promise a diagnostic log line.
Resolution: (open)
Core.ScriptedAlarms-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs |
| Status | Resolved |
Description: Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the engine — only the pure Part9StateMachine.ApplyShelvingCheck is tested, never ScriptedAlarmEngine driving the timer with an injectable clock; (2) ConfirmAsync, TimedShelveAsync, UnshelveAsync, EnableAsync engine methods (only Acknowledge, OneShotShelve, Disable, AddComment are exercised); (3) OnEvent subscriber-throws isolation (EmitEvent catch on line 357); (4) IAlarmStateStore.SaveAsync failure handling (finding 007); (5) re-entrant LoadAsync and the timer leak (finding 002); (6) the cold-start AreInputsReady guard with Bad / null / Uncertain inputs. The clock and scriptTimeout constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.
Recommendation: Add engine-level tests that inject a controllable Func<DateTime> clock to drive RunShelvingCheck, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.
Resolution: Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via RunShelvingCheckForTest, ConfirmAsync/TimedShelveAsync/UnshelveAsync/EnableAsync end-to-end, subscriber-exception isolation, store-failure invariant, second-LoadAsync timer-leak regression, and AreInputsReady Bad/Uncertain guard; exposed RunShelvingCheckForTest() internal hook on the engine.