lmxopcua/code-reviews/Core.ScriptedAlarms/findings.md

# Code Review — Core.ScriptedAlarms

| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |

## Checklist coverage

A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.

| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.ScriptedAlarms-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006 |
| 4 | Error handling & resilience | Core.ScriptedAlarms-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.ScriptedAlarms-008, Core.ScriptedAlarms-009 |
| 7 | Design-document adherence | Core.ScriptedAlarms-010 |
| 8 | Code organization & conventions | Core.ScriptedAlarms-011 |
| 9 | Testing coverage | Core.ScriptedAlarms-012 |
| 10 | Documentation & comments | Core.ScriptedAlarms-003 |

## Findings

### Core.ScriptedAlarms-001

| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368` |
| Status | Resolved |

**Description:** `_alarms` is a plain `Dictionary<string, AlarmState>` (line 42). Every mutation of it (`LoadAsync`, `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`) correctly happens under the `_evalGate` semaphore, but four read paths touch it with no synchronisation: `GetState` (line 175), `GetAllStates` (line 178-179), the `LoadedAlarmIds` property (line 73), and `RunShelvingCheck` (line 368, `_alarms.Keys.ToArray()`). `RunShelvingCheck` fires from a `Timer` thread-pool callback and can run concurrently with an `ApplyAsync`/`ReevaluateAsync` that is reassigning a dictionary entry. `Dictionary` is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw `InvalidOperationException` or return torn state. `GetState`/`GetAllStates` are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.

**Recommendation:** Either switch `_alarms` to `ConcurrentDictionary<string, AlarmState>` (entry reassignment via `_alarms[id] = ...` is already the only write shape, which a `ConcurrentDictionary` supports atomically), or acquire `_evalGate` in every reader. A `ConcurrentDictionary` is the lighter change and matches `_valueCache`, which is already concurrent.

**Resolution:** Resolved 2026-05-22 — switched `_alarms` to `ConcurrentDictionary<string, AlarmState>` so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.

### Core.ScriptedAlarms-002

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90` |
| Status | Resolved |

**Description:** `LoadAsync` is written to be re-callable — it begins by calling `UnsubscribeFromUpstream()`, `_alarms.Clear()`, and `_alarmsReferencing.Clear()` (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns `_shelvingTimer = new Timer(...)` without disposing the timer created by a previous `LoadAsync` call. A second `LoadAsync` therefore leaks the old `Timer` and leaves two timers running concurrently against the same `_alarms`/`_evalGate`. The old timer's `RunShelvingCheck` keeps firing forever.

**Recommendation:** Dispose any existing `_shelvingTimer` before reassigning it, e.g. `_shelvingTimer?.Dispose();` immediately before line 162, inside the `_evalGate` critical section. If reload is genuinely not supported, instead guard `LoadAsync` against a second call and document it as one-shot.

**Resolution:** Resolved 2026-05-22 — added `_shelvingTimer?.Dispose()` before the timer reassignment in `LoadAsync` so a second load call does not leak the previous timer.

### Core.ScriptedAlarms-003

| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` |
| Status | Resolved |

**Description:** `docs/ScriptedAlarms.md` (Composition step 3) and the `OnUpstreamChange` comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the `OnEvent` emission path as non-blocking / fire-and-forget. In the code, `EmitEvent` invokes `OnEvent?.Invoke(this, evt)` **synchronously while `_evalGate` is held** (called from `EvaluatePredicateToStateAsync` line 305 and `ApplyAsync` line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls `AcknowledgeAsync`) deadlocks because `_evalGate` is a non-reentrant `SemaphoreSlim(1,1)`. The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.

**Recommendation:** Either move `EmitEvent` outside the `_evalGate` critical section (collect emissions during the locked section and raise them after `Release()`), or document explicitly on `OnEvent` that handlers run under the engine lock, must be fast, and must never call back into the engine.

**Resolution:** Resolved 2026-05-23 — split `EmitEvent` into `BuildEmission` (called under the gate to capture a coherent value-cache snapshot for message-template resolution) and `FireEvent` (called after `_evalGate.Release()` so subscribers can re-enter the engine without deadlocking and a slow subscriber no longer blocks concurrent engine operations). Updated `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`, and `LoadAsync` (startup-recovery path) to collect emissions in a pending list and flush after the gate is released; added regression tests for both the re-entry path and a white-box gate-acquirable-from-subscriber check.

### Core.ScriptedAlarms-004

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234` |
| Status | Resolved |

**Description:** During `LoadAsync`, `_upstream.SubscribeTag(path, OnUpstreamChange)` is called inside the `_evalGate` critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside `SubscribeTag` (a common pattern, and the `ITagUpstreamSource` contract does not forbid it), the observer callback `OnUpstreamChange` runs on the calling thread, schedules `ReevaluateAsync`, which calls `_evalGate.WaitAsync`. That does not deadlock (the reevaluation task simply blocks until `LoadAsync` releases the gate), but it can cause a re-evaluation to run against a half-initialised `_alarms`/index, and the value written to `_valueCache` on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.

**Recommendation:** Subscribe to all upstream tags after the seed reads and after `_loaded = true`, or capture the subscription's first push into the cache and treat `SubscribeTag` as the single source of truth (drop the separate `ReadTag` seed). Document the expected `ITagUpstreamSource` delivery semantics (does `SubscribeTag` push an initial value?).

**Resolution:** Resolved 2026-05-22 — split the seed/subscribe loop: `ReadTag` seeds `_valueCache`, persisted-state restore runs, `_loaded = true` is set, then `SubscribeTag` is called; any synchronous initial push now arrives after `_alarms` is fully initialised and correctly queues behind the gate.

### Core.ScriptedAlarms-005

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424` |
| Status | Resolved |

**Description:** `Dispose` sets `_disposed = true`, disposes `_shelvingTimer`, and clears `_alarms`. A `RunShelvingCheck` callback already in flight on a thread-pool thread can have passed its `if (_disposed) return;` check (line 367) before `Dispose` ran, then proceed into `ShelvingCheckAsync`, which awaits `_evalGate` and mutates `_alarms` — concurrently with `Dispose`'s `_alarms.Clear()` at line 422 (which runs outside `_evalGate`). `Timer.Dispose()` does not wait for the running callback to finish. The result is a possible `InvalidOperationException` from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a `ReevaluateAsync` in flight from a late upstream push.

**Recommendation:** Use `Timer.Dispose(WaitHandle)` (or `DisposeAsync`) to wait for the callback to drain, and perform `_alarms.Clear()` under `_evalGate` (or simply drop the clear — the object is being discarded). Also have `ShelvingCheckAsync`/`ReevaluateAsync` re-check `_disposed` after acquiring the gate before mutating/saving.

**Resolution:** Resolved 2026-05-22 — added `_disposed` re-checks in `ReevaluateAsync` and `ShelvingCheckAsync` after acquiring `_evalGate` so late callbacks bail out cleanly; dropped the unsynchronised `_alarms.Clear()` from `Dispose` since the object is being discarded and the clear raced concurrent reads.

### Core.ScriptedAlarms-006

| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` |
| Status | Resolved |

**Description:** `OnUpstreamChange` and `RunShelvingCheck` both launch fire-and-forget tasks (`_ = ReevaluateAsync(...)`, `_ = ShelvingCheckAsync(...)`) with `CancellationToken.None`. There is no tracking of these in-flight tasks, so `Dispose` cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.

**Recommendation:** Track outstanding background tasks (or use a single serialised worker / `Channel`), and link them to a `CancellationTokenSource` that `Dispose` cancels and drains. At minimum, await the in-flight work in `Dispose`.

**Resolution:** Resolved 2026-05-23 — added `_inFlight` HashSet + `TrackBackgroundTask(...)` helper to register every fire-and-forget `ReevaluateAsync`/`ShelvingCheckAsync` task, with a sync `ContinueWith` continuation that auto-removes on completion. `Dispose` snapshots the set under its own lock and `Task.WhenAll(...).GetAwaiter().GetResult()` drains them before returning; `OnUpstreamChange` also short-circuits when `_disposed` is set so no new work is queued during shutdown. Regression test exercises the slow-store path: Dispose blocks until the in-flight `SaveAsync` completes.

### Core.ScriptedAlarms-007

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387` |
| Status | Resolved |

**Description:** Every state mutation calls `await _store.SaveAsync(...)` and relies on it succeeding. If the production SQL-backed `IAlarmStateStore` (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in `ApplyAsync` it surfaces to the Part 9 method caller *after* the in-memory `_alarms` entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in `ReevaluateAsync`/`ShelvingCheckAsync` it is caught and logged, but again the in-memory `_alarms` entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (`docs/ScriptedAlarms.md` State persistence) — that invariant is not actually enforced.

**Recommendation:** Save before committing the in-memory update, or roll back the in-memory entry if `SaveAsync` fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.

**Resolution:** Resolved 2026-05-22 — reordered `SaveAsync`/`_alarms[id]=` in `ApplyAsync`, `ReevaluateAsync`, and `ShelvingCheckAsync` so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.

### Core.ScriptedAlarms-008

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Part9StateMachine.cs:261-268` |
| Status | Resolved |

**Description:** `AppendComment` copies the entire existing comment list into a new `List` on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The `Comments` list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every `SaveAsync`. Over a multi-month uptime this is a slowly growing per-transition cost.

**Recommendation:** Acceptable for now given audit requirements, but consider an immutable persistent list / `ImmutableList<AlarmComment>` to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.

**Resolution:** Resolved 2026-05-23 — switched `AlarmConditionState.Comments` from `IReadOnlyList<AlarmComment>` to `ImmutableList<AlarmComment>` and rewrote `AppendComment` as `existing.Add(...)` so each append is O(log n) instead of the prior O(n) copy. `ImmutableList<T>` still implements `IReadOnlyList<T>` so existing consumers compile unchanged; the persistence layer continues to store comments as JSON so wire-format is unaffected. Regression test asserts the runtime type is `ImmutableList<AlarmComment>`.

### Core.ScriptedAlarms-009

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
| Status | Resolved |

**Description:** `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281).

**Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested.

**Resolution:** Resolved 2026-05-23 — added a per-alarm reusable `AlarmScratch`
(read-cache `Dictionary` + `AlarmPredicateContext`) held in
`_scratchByAlarmId`, populated lazily on first evaluation and refilled in place
by the new `RefillReadCache(Dictionary, IReadOnlySet)` helper on every
subsequent re-eval. `BuildReadCache` is gone. The reuse is safe because every
evaluation runs under `_evalGate`; the context wraps the dictionary by
reference, so the predicate's `ctx.GetTag(path)` sees the freshly-refilled
values. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a
config-publish drops the prior generation's scratch (Inputs / Logger may
change). Regression tests added in `ScriptedAlarmEngineTests`:
`Reevaluation_reuses_the_same_read_cache_dictionary`,
`Reevaluation_reuses_the_same_predicate_context`, and
`LoadAsync_drops_the_prior_generations_scratch`; internal test hooks
`TryGetScratchReadCacheForTest` / `TryGetScratchContextForTest` exposed via
the existing `InternalsVisibleTo`. `docs/v2/Galaxy.Performance.md` "Scripted-alarm
engine" section rewritten to document the new reuse contract. Suite now 66
green (was 63).

### Core.ScriptedAlarms-010

| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` |
| Status | Resolved |

**Description:** Quality handling is inconsistent across the three places that inspect a `DataValueSnapshot.StatusCode`. `AreInputsReady` (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. `MessageTemplate.Resolve` (line 47) rejects *any* non-zero status code — including Uncertain — and renders `{?}`. `AlarmPredicateContext.GetTag` returns `BadNodeIdUnknown` (`0x80340000`) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm *activation* decision but not good enough to print in the alarm *message*. `docs/ScriptedAlarms.md` ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.

**Recommendation:** Decide one quality policy for "is this input usable" and apply it in both `AreInputsReady` and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to `docs/ScriptedAlarms.md`.

**Resolution:** Resolved 2026-05-23 — documented the deliberate asymmetry. Added an "Input-quality policy" section to `docs/ScriptedAlarms.md` (table contrasting `AreInputsReady`'s Bad-only rejection with `MessageTemplate.Resolve`'s Good-only acceptance, plus the rationale) and a cross-referencing remarks block on `MessageTemplate.Resolve`. The two policies are kept distinct on purpose: predicate evaluation accepts Uncertain because the value is still inspectable, while the operator-facing message must render `{?}` to make the qualifier visible. Regression test locks in both behaviours with a single Uncertain-quality input that activates the alarm and surfaces `{?}` in the emission message.

### Core.ScriptedAlarms-011

| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Part9StateMachine.cs:275` |
| Status | Resolved |

**Description:** `TransitionResult.NoOp(state, reason)` takes a `reason` string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns `new(state, EmissionKind.None)`, identical to `None(state)`. Every call site that passes a carefully-worded reason string is doing dead work, and the comments in `Part9StateMachine` and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.

**Recommendation:** Either propagate the reason (add it to `TransitionResult` and have the engine log it at debug level when emission is `None` for a no-op), or remove the unused `reason` parameter and collapse `NoOp` into `None`. Update the `Part9StateMachine` remarks that promise a diagnostic log line.

**Resolution:** Resolved 2026-05-23 — added a nullable `NoOpReason` property to `TransitionResult` (defaulted on the primary constructor so existing positional `new TransitionResult(state, kind)` call sites remain valid) and propagated it from `TransitionResult.NoOp(state, reason)`. `ScriptedAlarmEngine.ApplyAsync` now logs the reason at debug level via the alarm's script logger when the transition is a no-op, fulfilling the class-level remarks. Two regression tests assert that `NoOp` carries the reason and `None` does not.

### Core.ScriptedAlarms-012

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs` |
| Status | Resolved |

**Description:** Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the *engine* — only the pure `Part9StateMachine.ApplyShelvingCheck` is tested, never `ScriptedAlarmEngine` driving the timer with an injectable clock; (2) `ConfirmAsync`, `TimedShelveAsync`, `UnshelveAsync`, `EnableAsync` engine methods (only `Acknowledge`, `OneShotShelve`, `Disable`, `AddComment` are exercised); (3) `OnEvent` subscriber-throws isolation (`EmitEvent` catch on line 357); (4) `IAlarmStateStore.SaveAsync` failure handling (finding 007); (5) re-entrant `LoadAsync` and the timer leak (finding 002); (6) the cold-start `AreInputsReady` guard with Bad / null / Uncertain inputs. The `clock` and `scriptTimeout` constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.

**Recommendation:** Add engine-level tests that inject a controllable `Func<DateTime>` clock to drive `RunShelvingCheck`, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.

**Resolution:** Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via `RunShelvingCheckForTest`, `ConfirmAsync`/`TimedShelveAsync`/`UnshelveAsync`/`EnableAsync` end-to-end, subscriber-exception isolation, store-failure invariant, second-`LoadAsync` timer-leak regression, and `AreInputsReady` Bad/Uncertain guard; exposed `RunShelvingCheckForTest()` internal hook on the engine.