fix(core-scripted-alarms): resolve Low code-review findings (Core.ScriptedAlarms-003,006,008,010,011; -009 documented)

- Core.ScriptedAlarms-003: emit OnEvent OUTSIDE _evalGate by collecting pending emissions during the gate-held section and flushing them after release; eliminates re-entrancy deadlock the docs already promised. - Core.ScriptedAlarms-006: track every fire-and-forget Reevaluate / ShelvingCheck task in _inFlight; Dispose drains the set so the engine no longer races store writes against teardown. - Core.ScriptedAlarms-008: store comments as ImmutableList<AlarmComment> so AppendComment is O(log n) instead of O(n). - Core.ScriptedAlarms-010: document the deliberate input-quality asymmetry (Uncertain drives the predicate, renders {?} in the message) in docs/ScriptedAlarms.md and on MessageTemplate.Resolve remarks. - Core.ScriptedAlarms-011: propagate the no-op reason through TransitionResult.NoOp(state, reason) and log it from ScriptedAlarmEngine.ApplyAsync. - Core.ScriptedAlarms-009 (Won't Fix per recommendation): documented the per-evaluation dictionary allocation in docs/v2/Galaxy.Performance.md with a mitigation path if a future soak surfaces pressure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 07:23:31 -04:00
parent e74e8f7b31
commit 99354bfaf2
8 changed files with 491 additions and 42 deletions
@@ -7,7 +7,7 @@
 | Review date | 2026-05-22 |
 | Commit reviewed | `76d35d1` |
 | Status | Reviewed |
-| Open findings | 6 |
+| Open findings | 0 |

 ## Checklist coverage

@@ -66,13 +66,13 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Documentation & comments |
 | Location | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` |
-| Status | Open |
+| Status | Resolved |

 **Description:** `docs/ScriptedAlarms.md` (Composition step 3) and the `OnUpstreamChange` comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the `OnEvent` emission path as non-blocking / fire-and-forget. In the code, `EmitEvent` invokes `OnEvent?.Invoke(this, evt)` **synchronously while `_evalGate` is held** (called from `EvaluatePredicateToStateAsync` line 305 and `ApplyAsync` line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls `AcknowledgeAsync`) deadlocks because `_evalGate` is a non-reentrant `SemaphoreSlim(1,1)`. The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.

 **Recommendation:** Either move `EmitEvent` outside the `_evalGate` critical section (collect emissions during the locked section and raise them after `Release()`), or document explicitly on `OnEvent` that handlers run under the engine lock, must be fast, and must never call back into the engine.

-**Resolution:** _(open)_
+**Resolution:** Resolved 2026-05-23 — split `EmitEvent` into `BuildEmission` (called under the gate to capture a coherent value-cache snapshot for message-template resolution) and `FireEvent` (called after `_evalGate.Release()` so subscribers can re-enter the engine without deadlocking and a slow subscriber no longer blocks concurrent engine operations). Updated `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`, and `LoadAsync` (startup-recovery path) to collect emissions in a pending list and flush after the gate is released; added regression tests for both the re-entry path and a white-box gate-acquirable-from-subscriber check.

 ### Core.ScriptedAlarms-004

@@ -111,13 +111,13 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Concurrency & thread safety |
 | Location | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` |
-| Status | Open |
+| Status | Resolved |

 **Description:** `OnUpstreamChange` and `RunShelvingCheck` both launch fire-and-forget tasks (`_ = ReevaluateAsync(...)`, `_ = ShelvingCheckAsync(...)`) with `CancellationToken.None`. There is no tracking of these in-flight tasks, so `Dispose` cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.

 **Recommendation:** Track outstanding background tasks (or use a single serialised worker / `Channel`), and link them to a `CancellationTokenSource` that `Dispose` cancels and drains. At minimum, await the in-flight work in `Dispose`.

-**Resolution:** _(open)_
+**Resolution:** Resolved 2026-05-23 — added `_inFlight` HashSet + `TrackBackgroundTask(...)` helper to register every fire-and-forget `ReevaluateAsync`/`ShelvingCheckAsync` task, with a sync `ContinueWith` continuation that auto-removes on completion. `Dispose` snapshots the set under its own lock and `Task.WhenAll(...).GetAwaiter().GetResult()` drains them before returning; `OnUpstreamChange` also short-circuits when `_disposed` is set so no new work is queued during shutdown. Regression test exercises the slow-store path: Dispose blocks until the in-flight `SaveAsync` completes.

 ### Core.ScriptedAlarms-007

@@ -141,13 +141,13 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Performance & resource management |
 | Location | `Part9StateMachine.cs:261-268` |
-| Status | Open |
+| Status | Resolved |

 **Description:** `AppendComment` copies the entire existing comment list into a new `List` on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The `Comments` list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every `SaveAsync`. Over a multi-month uptime this is a slowly growing per-transition cost.

 **Recommendation:** Acceptable for now given audit requirements, but consider an immutable persistent list / `ImmutableList<AlarmComment>` to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.

-**Resolution:** _(open)_
+**Resolution:** Resolved 2026-05-23 — switched `AlarmConditionState.Comments` from `IReadOnlyList<AlarmComment>` to `ImmutableList<AlarmComment>` and rewrote `AppendComment` as `existing.Add(...)` so each append is O(log n) instead of the prior O(n) copy. `ImmutableList<T>` still implements `IReadOnlyList<T>` so existing consumers compile unchanged; the persistence layer continues to store comments as JSON so wire-format is unaffected. Regression test asserts the runtime type is `ImmutableList<AlarmComment>`.

 ### Core.ScriptedAlarms-009

@@ -156,13 +156,13 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Performance & resource management |
 | Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
-| Status | Open |
+| Status | Won't Fix |

 **Description:** `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281).

 **Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested.

-**Resolution:** _(open)_
+**Resolution:** Won't Fix 2026-05-23 — per the recommendation, no code change. Documented the known allocation characteristic in `docs/v2/Galaxy.Performance.md` (new "Scripted-alarm engine — known hot-path allocations" section) so a future soak that surfaces pressure has a noted mitigation (reused per-alarm scratch buffer) and we don't re-find this in a later review.

 ### Core.ScriptedAlarms-010

@@ -171,13 +171,13 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Design-document adherence |
 | Location | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` |
-| Status | Open |
+| Status | Resolved |

 **Description:** Quality handling is inconsistent across the three places that inspect a `DataValueSnapshot.StatusCode`. `AreInputsReady` (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. `MessageTemplate.Resolve` (line 47) rejects *any* non-zero status code — including Uncertain — and renders `{?}`. `AlarmPredicateContext.GetTag` returns `BadNodeIdUnknown` (`0x80340000`) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm *activation* decision but not good enough to print in the alarm *message*. `docs/ScriptedAlarms.md` ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.

 **Recommendation:** Decide one quality policy for "is this input usable" and apply it in both `AreInputsReady` and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to `docs/ScriptedAlarms.md`.

-**Resolution:** _(open)_
+**Resolution:** Resolved 2026-05-23 — documented the deliberate asymmetry. Added an "Input-quality policy" section to `docs/ScriptedAlarms.md` (table contrasting `AreInputsReady`'s Bad-only rejection with `MessageTemplate.Resolve`'s Good-only acceptance, plus the rationale) and a cross-referencing remarks block on `MessageTemplate.Resolve`. The two policies are kept distinct on purpose: predicate evaluation accepts Uncertain because the value is still inspectable, while the operator-facing message must render `{?}` to make the qualifier visible. Regression test locks in both behaviours with a single Uncertain-quality input that activates the alarm and surfaces `{?}` in the emission message.

 ### Core.ScriptedAlarms-011

@@ -186,13 +186,13 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Code organization & conventions |
 | Location | `Part9StateMachine.cs:275` |
-| Status | Open |
+| Status | Resolved |

 **Description:** `TransitionResult.NoOp(state, reason)` takes a `reason` string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns `new(state, EmissionKind.None)`, identical to `None(state)`. Every call site that passes a carefully-worded reason string is doing dead work, and the comments in `Part9StateMachine` and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.

 **Recommendation:** Either propagate the reason (add it to `TransitionResult` and have the engine log it at debug level when emission is `None` for a no-op), or remove the unused `reason` parameter and collapse `NoOp` into `None`. Update the `Part9StateMachine` remarks that promise a diagnostic log line.

-**Resolution:** _(open)_
+**Resolution:** Resolved 2026-05-23 — added a nullable `NoOpReason` property to `TransitionResult` (defaulted on the primary constructor so existing positional `new TransitionResult(state, kind)` call sites remain valid) and propagated it from `TransitionResult.NoOp(state, reason)`. `ScriptedAlarmEngine.ApplyAsync` now logs the reason at debug level via the alarm's script logger when the transition is a no-op, fulfilling the class-level remarks. Two regression tests assert that `NoOp` carries the reason and `None` does not.

 ### Core.ScriptedAlarms-012