# Code Review — Core.ScriptedAlarms | Field | Value | |---|---| | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` | | Reviewer | Claude Code | | Review date | 2026-05-22 | | Commit reviewed | `76d35d1` | | Status | Reviewed | | Open findings | 6 | ## Checklist coverage A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank. | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | Core.ScriptedAlarms-002 | | 2 | OtOpcUa conventions | No issues found | | 3 | Concurrency & thread safety | Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006 | | 4 | Error handling & resilience | Core.ScriptedAlarms-007 | | 5 | Security | No issues found | | 6 | Performance & resource management | Core.ScriptedAlarms-008, Core.ScriptedAlarms-009 | | 7 | Design-document adherence | Core.ScriptedAlarms-010 | | 8 | Code organization & conventions | Core.ScriptedAlarms-011 | | 9 | Testing coverage | Core.ScriptedAlarms-012 | | 10 | Documentation & comments | Core.ScriptedAlarms-003 | ## Findings ### Core.ScriptedAlarms-001 | Field | Value | |---|---| | Severity | High | | Category | Concurrency & thread safety | | Location | `ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368` | | Status | Resolved | **Description:** `_alarms` is a plain `Dictionary` (line 42). Every mutation of it (`LoadAsync`, `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`) correctly happens under the `_evalGate` semaphore, but four read paths touch it with no synchronisation: `GetState` (line 175), `GetAllStates` (line 178-179), the `LoadedAlarmIds` property (line 73), and `RunShelvingCheck` (line 368, `_alarms.Keys.ToArray()`). `RunShelvingCheck` fires from a `Timer` thread-pool callback and can run concurrently with an `ApplyAsync`/`ReevaluateAsync` that is reassigning a dictionary entry. `Dictionary` is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw `InvalidOperationException` or return torn state. `GetState`/`GetAllStates` are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads. **Recommendation:** Either switch `_alarms` to `ConcurrentDictionary` (entry reassignment via `_alarms[id] = ...` is already the only write shape, which a `ConcurrentDictionary` supports atomically), or acquire `_evalGate` in every reader. A `ConcurrentDictionary` is the lighter change and matches `_valueCache`, which is already concurrent. **Resolution:** Resolved 2026-05-22 — switched `_alarms` to `ConcurrentDictionary` so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test. ### Core.ScriptedAlarms-002 | Field | Value | |---|---| | Severity | Medium | | Category | Correctness & logic bugs | | Location | `ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90` | | Status | Resolved | **Description:** `LoadAsync` is written to be re-callable — it begins by calling `UnsubscribeFromUpstream()`, `_alarms.Clear()`, and `_alarmsReferencing.Clear()` (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns `_shelvingTimer = new Timer(...)` without disposing the timer created by a previous `LoadAsync` call. A second `LoadAsync` therefore leaks the old `Timer` and leaves two timers running concurrently against the same `_alarms`/`_evalGate`. The old timer's `RunShelvingCheck` keeps firing forever. **Recommendation:** Dispose any existing `_shelvingTimer` before reassigning it, e.g. `_shelvingTimer?.Dispose();` immediately before line 162, inside the `_evalGate` critical section. If reload is genuinely not supported, instead guard `LoadAsync` against a second call and document it as one-shot. **Resolution:** Resolved 2026-05-22 — added `_shelvingTimer?.Dispose()` before the timer reassignment in `LoadAsync` so a second load call does not leak the previous timer. ### Core.ScriptedAlarms-003 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` | | Status | Open | **Description:** `docs/ScriptedAlarms.md` (Composition step 3) and the `OnUpstreamChange` comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the `OnEvent` emission path as non-blocking / fire-and-forget. In the code, `EmitEvent` invokes `OnEvent?.Invoke(this, evt)` **synchronously while `_evalGate` is held** (called from `EvaluatePredicateToStateAsync` line 305 and `ApplyAsync` line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls `AcknowledgeAsync`) deadlocks because `_evalGate` is a non-reentrant `SemaphoreSlim(1,1)`. The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented. **Recommendation:** Either move `EmitEvent` outside the `_evalGate` critical section (collect emissions during the locked section and raise them after `Release()`), or document explicitly on `OnEvent` that handlers run under the engine lock, must be fast, and must never call back into the engine. **Resolution:** _(open)_ ### Core.ScriptedAlarms-004 | Field | Value | |---|---| | Severity | Medium | | Category | Concurrency & thread safety | | Location | `ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234` | | Status | Resolved | **Description:** During `LoadAsync`, `_upstream.SubscribeTag(path, OnUpstreamChange)` is called inside the `_evalGate` critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside `SubscribeTag` (a common pattern, and the `ITagUpstreamSource` contract does not forbid it), the observer callback `OnUpstreamChange` runs on the calling thread, schedules `ReevaluateAsync`, which calls `_evalGate.WaitAsync`. That does not deadlock (the reevaluation task simply blocks until `LoadAsync` releases the gate), but it can cause a re-evaluation to run against a half-initialised `_alarms`/index, and the value written to `_valueCache` on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value. **Recommendation:** Subscribe to all upstream tags after the seed reads and after `_loaded = true`, or capture the subscription's first push into the cache and treat `SubscribeTag` as the single source of truth (drop the separate `ReadTag` seed). Document the expected `ITagUpstreamSource` delivery semantics (does `SubscribeTag` push an initial value?). **Resolution:** Resolved 2026-05-22 — split the seed/subscribe loop: `ReadTag` seeds `_valueCache`, persisted-state restore runs, `_loaded = true` is set, then `SubscribeTag` is called; any synchronous initial push now arrives after `_alarms` is fully initialised and correctly queues behind the gate. ### Core.ScriptedAlarms-005 | Field | Value | |---|---| | Severity | Medium | | Category | Concurrency & thread safety | | Location | `ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424` | | Status | Resolved | **Description:** `Dispose` sets `_disposed = true`, disposes `_shelvingTimer`, and clears `_alarms`. A `RunShelvingCheck` callback already in flight on a thread-pool thread can have passed its `if (_disposed) return;` check (line 367) before `Dispose` ran, then proceed into `ShelvingCheckAsync`, which awaits `_evalGate` and mutates `_alarms` — concurrently with `Dispose`'s `_alarms.Clear()` at line 422 (which runs outside `_evalGate`). `Timer.Dispose()` does not wait for the running callback to finish. The result is a possible `InvalidOperationException` from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a `ReevaluateAsync` in flight from a late upstream push. **Recommendation:** Use `Timer.Dispose(WaitHandle)` (or `DisposeAsync`) to wait for the callback to drain, and perform `_alarms.Clear()` under `_evalGate` (or simply drop the clear — the object is being discarded). Also have `ShelvingCheckAsync`/`ReevaluateAsync` re-check `_disposed` after acquiring the gate before mutating/saving. **Resolution:** Resolved 2026-05-22 — added `_disposed` re-checks in `ReevaluateAsync` and `ShelvingCheckAsync` after acquiring `_evalGate` so late callbacks bail out cleanly; dropped the unsynchronised `_alarms.Clear()` from `Dispose` since the object is being discarded and the clear raced concurrent reads. ### Core.ScriptedAlarms-006 | Field | Value | |---|---| | Severity | Low | | Category | Concurrency & thread safety | | Location | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` | | Status | Open | **Description:** `OnUpstreamChange` and `RunShelvingCheck` both launch fire-and-forget tasks (`_ = ReevaluateAsync(...)`, `_ = ShelvingCheckAsync(...)`) with `CancellationToken.None`. There is no tracking of these in-flight tasks, so `Dispose` cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state. **Recommendation:** Track outstanding background tasks (or use a single serialised worker / `Channel`), and link them to a `CancellationTokenSource` that `Dispose` cancels and drains. At minimum, await the in-flight work in `Dispose`. **Resolution:** _(open)_ ### Core.ScriptedAlarms-007 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387` | | Status | Resolved | **Description:** Every state mutation calls `await _store.SaveAsync(...)` and relies on it succeeding. If the production SQL-backed `IAlarmStateStore` (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in `ApplyAsync` it surfaces to the Part 9 method caller *after* the in-memory `_alarms` entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in `ReevaluateAsync`/`ShelvingCheckAsync` it is caught and logged, but again the in-memory `_alarms` entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (`docs/ScriptedAlarms.md` State persistence) — that invariant is not actually enforced. **Recommendation:** Save before committing the in-memory update, or roll back the in-memory entry if `SaveAsync` fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing. **Resolution:** Resolved 2026-05-22 — reordered `SaveAsync`/`_alarms[id]=` in `ApplyAsync`, `ReevaluateAsync`, and `ShelvingCheckAsync` so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging. ### Core.ScriptedAlarms-008 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `Part9StateMachine.cs:261-268` | | Status | Open | **Description:** `AppendComment` copies the entire existing comment list into a new `List` on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The `Comments` list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every `SaveAsync`. Over a multi-month uptime this is a slowly growing per-transition cost. **Recommendation:** Acceptable for now given audit requirements, but consider an immutable persistent list / `ImmutableList` to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc. **Resolution:** _(open)_ ### Core.ScriptedAlarms-009 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` | | Status | Open | **Description:** `BuildReadCache` allocates a fresh `Dictionary` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281). **Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested. **Resolution:** _(open)_ ### Core.ScriptedAlarms-010 | Field | Value | |---|---| | Severity | Low | | Category | Design-document adherence | | Location | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` | | Status | Open | **Description:** Quality handling is inconsistent across the three places that inspect a `DataValueSnapshot.StatusCode`. `AreInputsReady` (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. `MessageTemplate.Resolve` (line 47) rejects *any* non-zero status code — including Uncertain — and renders `{?}`. `AlarmPredicateContext.GetTag` returns `BadNodeIdUnknown` (`0x80340000`) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm *activation* decision but not good enough to print in the alarm *message*. `docs/ScriptedAlarms.md` ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented. **Recommendation:** Decide one quality policy for "is this input usable" and apply it in both `AreInputsReady` and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to `docs/ScriptedAlarms.md`. **Resolution:** _(open)_ ### Core.ScriptedAlarms-011 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `Part9StateMachine.cs:275` | | Status | Open | **Description:** `TransitionResult.NoOp(state, reason)` takes a `reason` string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns `new(state, EmissionKind.None)`, identical to `None(state)`. Every call site that passes a carefully-worded reason string is doing dead work, and the comments in `Part9StateMachine` and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not. **Recommendation:** Either propagate the reason (add it to `TransitionResult` and have the engine log it at debug level when emission is `None` for a no-op), or remove the unused `reason` parameter and collapse `NoOp` into `None`. Update the `Part9StateMachine` remarks that promise a diagnostic log line. **Resolution:** _(open)_ ### Core.ScriptedAlarms-012 | Field | Value | |---|---| | Severity | Medium | | Category | Testing coverage | | Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs` | | Status | Resolved | **Description:** Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the *engine* — only the pure `Part9StateMachine.ApplyShelvingCheck` is tested, never `ScriptedAlarmEngine` driving the timer with an injectable clock; (2) `ConfirmAsync`, `TimedShelveAsync`, `UnshelveAsync`, `EnableAsync` engine methods (only `Acknowledge`, `OneShotShelve`, `Disable`, `AddComment` are exercised); (3) `OnEvent` subscriber-throws isolation (`EmitEvent` catch on line 357); (4) `IAlarmStateStore.SaveAsync` failure handling (finding 007); (5) re-entrant `LoadAsync` and the timer leak (finding 002); (6) the cold-start `AreInputsReady` guard with Bad / null / Uncertain inputs. The `clock` and `scriptTimeout` constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them. **Recommendation:** Add engine-level tests that inject a controllable `Func` clock to drive `RunShelvingCheck`, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007. **Resolution:** Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via `RunShelvingCheckForTest`, `ConfirmAsync`/`TimedShelveAsync`/`UnshelveAsync`/`EnableAsync` end-to-end, subscriber-exception isolation, store-failure invariant, second-`LoadAsync` timer-leak regression, and `AreInputsReady` Bad/Uncertain guard; exposed `RunShelvingCheckForTest()` internal hook on the engine.