lmxopcua/code-reviews/Core.ScriptedAlarms/findings.md

# Code Review — Core.ScriptedAlarms

| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` |
| Reviewer | Claude Code |
| Review date | 2026-05-23 |
| Commit reviewed | `a9be809` |
| Status | Reviewed |
| Open findings | 1 |

## Checklist coverage

A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.

### 2026-05-22 review (commit `76d35d1`)

| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.ScriptedAlarms-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006 |
| 4 | Error handling & resilience | Core.ScriptedAlarms-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.ScriptedAlarms-008, Core.ScriptedAlarms-009 |
| 7 | Design-document adherence | Core.ScriptedAlarms-010 |
| 8 | Code organization & conventions | Core.ScriptedAlarms-011 |
| 9 | Testing coverage | Core.ScriptedAlarms-012 |
| 10 | Documentation & comments | Core.ScriptedAlarms-003 |

### 2026-05-23 re-review (commit `a9be809`)

Focused re-review of the Core.ScriptedAlarms-009 resolution (commit `0001cdd`) —
new `AlarmScratch` class, `_scratchByAlarmId` ConcurrentDictionary, `RefillReadCache`
helper, and internal test accessors. Only the changed/new code since `76d35d1` was
re-examined; existing closed findings stay as audit trail.

| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | No issues found |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Core.ScriptedAlarms-013 |

## Findings

### Core.ScriptedAlarms-001

| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368` |
| Status | Resolved |

**Description:** `_alarms` is a plain `Dictionary<string, AlarmState>` (line 42). Every mutation of it (`LoadAsync`, `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`) correctly happens under the `_evalGate` semaphore, but four read paths touch it with no synchronisation: `GetState` (line 175), `GetAllStates` (line 178-179), the `LoadedAlarmIds` property (line 73), and `RunShelvingCheck` (line 368, `_alarms.Keys.ToArray()`). `RunShelvingCheck` fires from a `Timer` thread-pool callback and can run concurrently with an `ApplyAsync`/`ReevaluateAsync` that is reassigning a dictionary entry. `Dictionary` is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw `InvalidOperationException` or return torn state. `GetState`/`GetAllStates` are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.

**Recommendation:** Either switch `_alarms` to `ConcurrentDictionary<string, AlarmState>` (entry reassignment via `_alarms[id] = ...` is already the only write shape, which a `ConcurrentDictionary` supports atomically), or acquire `_evalGate` in every reader. A `ConcurrentDictionary` is the lighter change and matches `_valueCache`, which is already concurrent.

**Resolution:** Resolved 2026-05-22 — switched `_alarms` to `ConcurrentDictionary<string, AlarmState>` so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.

### Core.ScriptedAlarms-002

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90` |
| Status | Resolved |

**Description:** `LoadAsync` is written to be re-callable — it begins by calling `UnsubscribeFromUpstream()`, `_alarms.Clear()`, and `_alarmsReferencing.Clear()` (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns `_shelvingTimer = new Timer(...)` without disposing the timer created by a previous `LoadAsync` call. A second `LoadAsync` therefore leaks the old `Timer` and leaves two timers running concurrently against the same `_alarms`/`_evalGate`. The old timer's `RunShelvingCheck` keeps firing forever.

**Recommendation:** Dispose any existing `_shelvingTimer` before reassigning it, e.g. `_shelvingTimer?.Dispose();` immediately before line 162, inside the `_evalGate` critical section. If reload is genuinely not supported, instead guard `LoadAsync` against a second call and document it as one-shot.

**Resolution:** Resolved 2026-05-22 — added `_shelvingTimer?.Dispose()` before the timer reassignment in `LoadAsync` so a second load call does not leak the previous timer.

### Core.ScriptedAlarms-003

| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` |
| Status | Resolved |

**Description:** `docs/ScriptedAlarms.md` (Composition step 3) and the `OnUpstreamChange` comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the `OnEvent` emission path as non-blocking / fire-and-forget. In the code, `EmitEvent` invokes `OnEvent?.Invoke(this, evt)` **synchronously while `_evalGate` is held** (called from `EvaluatePredicateToStateAsync` line 305 and `ApplyAsync` line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls `AcknowledgeAsync`) deadlocks because `_evalGate` is a non-reentrant `SemaphoreSlim(1,1)`. The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.

**Recommendation:** Either move `EmitEvent` outside the `_evalGate` critical section (collect emissions during the locked section and raise them after `Release()`), or document explicitly on `OnEvent` that handlers run under the engine lock, must be fast, and must never call back into the engine.

**Resolution:** Resolved 2026-05-23 — split `EmitEvent` into `BuildEmission` (called under the gate to capture a coherent value-cache snapshot for message-template resolution) and `FireEvent` (called after `_evalGate.Release()` so subscribers can re-enter the engine without deadlocking and a slow subscriber no longer blocks concurrent engine operations). Updated `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`, and `LoadAsync` (startup-recovery path) to collect emissions in a pending list and flush after the gate is released; added regression tests for both the re-entry path and a white-box gate-acquirable-from-subscriber check.

### Core.ScriptedAlarms-004

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234` |
| Status | Resolved |

**Description:** During `LoadAsync`, `_upstream.SubscribeTag(path, OnUpstreamChange)` is called inside the `_evalGate` critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside `SubscribeTag` (a common pattern, and the `ITagUpstreamSource` contract does not forbid it), the observer callback `OnUpstreamChange` runs on the calling thread, schedules `ReevaluateAsync`, which calls `_evalGate.WaitAsync`. That does not deadlock (the reevaluation task simply blocks until `LoadAsync` releases the gate), but it can cause a re-evaluation to run against a half-initialised `_alarms`/index, and the value written to `_valueCache` on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.

**Recommendation:** Subscribe to all upstream tags after the seed reads and after `_loaded = true`, or capture the subscription's first push into the cache and treat `SubscribeTag` as the single source of truth (drop the separate `ReadTag` seed). Document the expected `ITagUpstreamSource` delivery semantics (does `SubscribeTag` push an initial value?).

**Resolution:** Resolved 2026-05-22 — split the seed/subscribe loop: `ReadTag` seeds `_valueCache`, persisted-state restore runs, `_loaded = true` is set, then `SubscribeTag` is called; any synchronous initial push now arrives after `_alarms` is fully initialised and correctly queues behind the gate.

### Core.ScriptedAlarms-005

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424` |
| Status | Resolved |

**Description:** `Dispose` sets `_disposed = true`, disposes `_shelvingTimer`, and clears `_alarms`. A `RunShelvingCheck` callback already in flight on a thread-pool thread can have passed its `if (_disposed) return;` check (line 367) before `Dispose` ran, then proceed into `ShelvingCheckAsync`, which awaits `_evalGate` and mutates `_alarms` — concurrently with `Dispose`'s `_alarms.Clear()` at line 422 (which runs outside `_evalGate`). `Timer.Dispose()` does not wait for the running callback to finish. The result is a possible `InvalidOperationException` from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a `ReevaluateAsync` in flight from a late upstream push.

**Recommendation:** Use `Timer.Dispose(WaitHandle)` (or `DisposeAsync`) to wait for the callback to drain, and perform `_alarms.Clear()` under `_evalGate` (or simply drop the clear — the object is being discarded). Also have `ShelvingCheckAsync`/`ReevaluateAsync` re-check `_disposed` after acquiring the gate before mutating/saving.

**Resolution:** Resolved 2026-05-22 — added `_disposed` re-checks in `ReevaluateAsync` and `ShelvingCheckAsync` after acquiring `_evalGate` so late callbacks bail out cleanly; dropped the unsynchronised `_alarms.Clear()` from `Dispose` since the object is being discarded and the clear raced concurrent reads.

### Core.ScriptedAlarms-006

| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` |
| Status | Resolved |

**Description:** `OnUpstreamChange` and `RunShelvingCheck` both launch fire-and-forget tasks (`_ = ReevaluateAsync(...)`, `_ = ShelvingCheckAsync(...)`) with `CancellationToken.None`. There is no tracking of these in-flight tasks, so `Dispose` cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.

**Recommendation:** Track outstanding background tasks (or use a single serialised worker / `Channel`), and link them to a `CancellationTokenSource` that `Dispose` cancels and drains. At minimum, await the in-flight work in `Dispose`.

**Resolution:** Resolved 2026-05-23 — added `_inFlight` HashSet + `TrackBackgroundTask(...)` helper to register every fire-and-forget `ReevaluateAsync`/`ShelvingCheckAsync` task, with a sync `ContinueWith` continuation that auto-removes on completion. `Dispose` snapshots the set under its own lock and `Task.WhenAll(...).GetAwaiter().GetResult()` drains them before returning; `OnUpstreamChange` also short-circuits when `_disposed` is set so no new work is queued during shutdown. Regression test exercises the slow-store path: Dispose blocks until the in-flight `SaveAsync` completes.

### Core.ScriptedAlarms-007

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387` |
| Status | Resolved |

**Description:** Every state mutation calls `await _store.SaveAsync(...)` and relies on it succeeding. If the production SQL-backed `IAlarmStateStore` (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in `ApplyAsync` it surfaces to the Part 9 method caller *after* the in-memory `_alarms` entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in `ReevaluateAsync`/`ShelvingCheckAsync` it is caught and logged, but again the in-memory `_alarms` entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (`docs/ScriptedAlarms.md` State persistence) — that invariant is not actually enforced.

**Recommendation:** Save before committing the in-memory update, or roll back the in-memory entry if `SaveAsync` fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.

**Resolution:** Resolved 2026-05-22 — reordered `SaveAsync`/`_alarms[id]=` in `ApplyAsync`, `ReevaluateAsync`, and `ShelvingCheckAsync` so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.

### Core.ScriptedAlarms-008

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Part9StateMachine.cs:261-268` |
| Status | Resolved |

**Description:** `AppendComment` copies the entire existing comment list into a new `List` on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The `Comments` list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every `SaveAsync`. Over a multi-month uptime this is a slowly growing per-transition cost.

**Recommendation:** Acceptable for now given audit requirements, but consider an immutable persistent list / `ImmutableList<AlarmComment>` to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.

**Resolution:** Resolved 2026-05-23 — switched `AlarmConditionState.Comments` from `IReadOnlyList<AlarmComment>` to `ImmutableList<AlarmComment>` and rewrote `AppendComment` as `existing.Add(...)` so each append is O(log n) instead of the prior O(n) copy. `ImmutableList<T>` still implements `IReadOnlyList<T>` so existing consumers compile unchanged; the persistence layer continues to store comments as JSON so wire-format is unaffected. Regression test asserts the runtime type is `ImmutableList<AlarmComment>`.

### Core.ScriptedAlarms-009

| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
| Status | Resolved |

**Description:** `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281).

**Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested.

**Resolution:** Resolved 2026-05-23 — added a per-alarm reusable `AlarmScratch`
(read-cache `Dictionary` + `AlarmPredicateContext`) held in
`_scratchByAlarmId`, populated lazily on first evaluation and refilled in place
by the new `RefillReadCache(Dictionary, IReadOnlySet)` helper on every
subsequent re-eval. `BuildReadCache` is gone. The reuse is safe because every
evaluation runs under `_evalGate`; the context wraps the dictionary by
reference, so the predicate's `ctx.GetTag(path)` sees the freshly-refilled
values. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a
config-publish drops the prior generation's scratch (Inputs / Logger may
change). Regression tests added in `ScriptedAlarmEngineTests`:
`Reevaluation_reuses_the_same_read_cache_dictionary`,
`Reevaluation_reuses_the_same_predicate_context`, and
`LoadAsync_drops_the_prior_generations_scratch`; internal test hooks
`TryGetScratchReadCacheForTest` / `TryGetScratchContextForTest` exposed via
the existing `InternalsVisibleTo`. `docs/v2/Galaxy.Performance.md` "Scripted-alarm
engine" section rewritten to document the new reuse contract. Suite now 66
green (was 63).

### Core.ScriptedAlarms-010

| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` |
| Status | Resolved |

**Description:** Quality handling is inconsistent across the three places that inspect a `DataValueSnapshot.StatusCode`. `AreInputsReady` (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. `MessageTemplate.Resolve` (line 47) rejects *any* non-zero status code — including Uncertain — and renders `{?}`. `AlarmPredicateContext.GetTag` returns `BadNodeIdUnknown` (`0x80340000`) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm *activation* decision but not good enough to print in the alarm *message*. `docs/ScriptedAlarms.md` ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.

**Recommendation:** Decide one quality policy for "is this input usable" and apply it in both `AreInputsReady` and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to `docs/ScriptedAlarms.md`.

**Resolution:** Resolved 2026-05-23 — documented the deliberate asymmetry. Added an "Input-quality policy" section to `docs/ScriptedAlarms.md` (table contrasting `AreInputsReady`'s Bad-only rejection with `MessageTemplate.Resolve`'s Good-only acceptance, plus the rationale) and a cross-referencing remarks block on `MessageTemplate.Resolve`. The two policies are kept distinct on purpose: predicate evaluation accepts Uncertain because the value is still inspectable, while the operator-facing message must render `{?}` to make the qualifier visible. Regression test locks in both behaviours with a single Uncertain-quality input that activates the alarm and surfaces `{?}` in the emission message.

### Core.ScriptedAlarms-011

| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Part9StateMachine.cs:275` |
| Status | Resolved |

**Description:** `TransitionResult.NoOp(state, reason)` takes a `reason` string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns `new(state, EmissionKind.None)`, identical to `None(state)`. Every call site that passes a carefully-worded reason string is doing dead work, and the comments in `Part9StateMachine` and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.

**Recommendation:** Either propagate the reason (add it to `TransitionResult` and have the engine log it at debug level when emission is `None` for a no-op), or remove the unused `reason` parameter and collapse `NoOp` into `None`. Update the `Part9StateMachine` remarks that promise a diagnostic log line.

**Resolution:** Resolved 2026-05-23 — added a nullable `NoOpReason` property to `TransitionResult` (defaulted on the primary constructor so existing positional `new TransitionResult(state, kind)` call sites remain valid) and propagated it from `TransitionResult.NoOp(state, reason)`. `ScriptedAlarmEngine.ApplyAsync` now logs the reason at debug level via the alarm's script logger when the transition is a no-op, fulfilling the class-level remarks. Two regression tests assert that `NoOp` carries the reason and `None` does not.

### Core.ScriptedAlarms-012

| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs` |
| Status | Resolved |

**Description:** Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the *engine* — only the pure `Part9StateMachine.ApplyShelvingCheck` is tested, never `ScriptedAlarmEngine` driving the timer with an injectable clock; (2) `ConfirmAsync`, `TimedShelveAsync`, `UnshelveAsync`, `EnableAsync` engine methods (only `Acknowledge`, `OneShotShelve`, `Disable`, `AddComment` are exercised); (3) `OnEvent` subscriber-throws isolation (`EmitEvent` catch on line 357); (4) `IAlarmStateStore.SaveAsync` failure handling (finding 007); (5) re-entrant `LoadAsync` and the timer leak (finding 002); (6) the cold-start `AreInputsReady` guard with Bad / null / Uncertain inputs. The `clock` and `scriptTimeout` constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.

**Recommendation:** Add engine-level tests that inject a controllable `Func<DateTime>` clock to drive `RunShelvingCheck`, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.

**Resolution:** Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via `RunShelvingCheckForTest`, `ConfirmAsync`/`TimedShelveAsync`/`UnshelveAsync`/`EnableAsync` end-to-end, subscriber-exception isolation, store-failure invariant, second-`LoadAsync` timer-leak regression, and `AreInputsReady` Bad/Uncertain guard; exposed `RunShelvingCheckForTest()` internal hook on the engine.

### Core.ScriptedAlarms-013

| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ScriptedAlarmEngine.cs:66-81` |
| Status | Open |

**Description:** The new internal test accessors `TryGetScratchReadCacheForTest` and `TryGetScratchContextForTest` (introduced by the Core.ScriptedAlarms-009 resolution at `0001cdd`) return the *live* per-alarm scratch — the same `Dictionary<string, DataValueSnapshot>` instance the engine clears and refills in `RefillReadCache` under `_evalGate`, plus the `AlarmPredicateContext` that wraps it by reference. The XML docs describe the intended use ("assert the scratch is reused across evaluations (two reads return the same instance)") but do not explicitly warn that:

1. The returned `IReadOnlyDictionary` is the engine's mutable working set. Enumerating it from a test thread while the engine is mid-evaluation (e.g. during a `ReevaluateAsync` queued by `OnUpstreamChange`, or a `ShelvingCheckAsync` callback) is a concurrent-read-while-writer scenario against a plain `Dictionary` — undefined behaviour, can throw `InvalidOperationException` or return torn data.
2. Reference-equality comparisons (`ReferenceEquals(a, b)`) and single-key indexer reads (`dict["Temp"]`) on a quiesced engine are the only safe uses. The existing regression tests stay within those bounds, but a future test author has no in-code signal that broader reads are unsafe.

The engine itself is correct — `RefillReadCache` runs only under `_evalGate`, so the engine never tears its own state. The risk is purely on the test-side contract.

**Recommendation:** Add a `<remarks>` block to both `TryGetScratchReadCacheForTest` and `TryGetScratchContextForTest` stating that the returned references point at live engine state, that reads are only safe when the engine is known to be idle (no in-flight `ReevaluateAsync`/`ShelvingCheckAsync`/`LoadAsync`), and that the intended uses are reference-identity assertions plus single-key lookups against a quiesced engine — never enumeration. No code change required; the engine's correctness depends on `_evalGate`, which is already documented.