review(Core.ScriptedAlarms): stop shelving timer on failed reload + drop dead branch

Re-review at 7286d320. -015: dispose shelving timer at top of LoadAsync so a failed reload doesn't leave it firing against partially-cleared state + test. -014: make pendingEmissions required (removes unreachable fire-under-gate branch that could reintroduce the -003 deadlock).
2026-06-19 11:21:35 -04:00
parent 621d00e455
commit 272a9da61e
3 changed files with 186 additions and 14 deletions
@@ -4,8 +4,8 @@
 |---|---|
 | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` |
 | Reviewer | Claude Code |
-| Review date | 2026-05-23 |
-| Commit reviewed | `a9be809` |
+| Review date | 2026-06-19 |
+| Commit reviewed | `7286d320` |
 | Status | Reviewed |
 | Open findings | 0 |

@@ -49,6 +49,38 @@ re-examined; existing closed findings stay as audit trail.
 | 9 | Testing coverage | No issues found |
 | 10 | Documentation & comments | Core.ScriptedAlarms-013 |

+### 2026-06-19 re-review (commit `7286d320`)
+
+Full re-review at HEAD. Covers all commits since `a9be809`: `AlarmPredicateContext`
+moved to `Core.Scripting.Abstractions`; `ScriptedAlarmEvent` gained `Comment` and
+`HistorizeToAveva` fields; `AlarmAcknowledgeRequest.OperatorUser` wired through
+`ScriptedAlarmSource`; `CompiledScriptCache` routing added; `_alarms.Clear()` in
+`Dispose` re-enabled under `-016`; mass XML-doc backfill; CVE patches. All prior
+findings remain Resolved. Cross-module observation noted below checklist
+(not recorded as a module finding — root cause is in Server).
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | No issues found |
+| 2 | OtOpcUa conventions | No issues found |
+| 3 | Concurrency & thread safety | No issues found |
+| 4 | Error handling & resilience | Core.ScriptedAlarms-015 |
+| 5 | Security | No issues found |
+| 6 | Performance & resource management | No issues found |
+| 7 | Design-document adherence | No issues found |
+| 8 | Code organization & conventions | Core.ScriptedAlarms-014 |
+| 9 | Testing coverage | No issues found |
+| 10 | Documentation & comments | No issues found |
+
+**Cross-module observation (2026-06-19):** `OtOpcUaNodeManager.cs:677` constructs an
+`AlarmCommand` with `User: string.Empty` for the system-timer `OnTimedUnshelve` bypass
+path. This empty string flows to `ScriptedAlarmHostActor.OnAlarmCommand` →
+`_engine.UnshelveAsync("", ...)` → `Part9StateMachine.ApplyUnshelve` which throws
+`ArgumentException` for `IsNullOrWhiteSpace(user)`. The exception is caught and
+logged at the actor level (not a crash), but the auto-unshelve silently fails. Root
+cause is in the Server module (`OtOpcUaNodeManager.cs`); the engine's user-validation
+guard is correct. Fix belongs in a Server module review.
+
 ## Findings

 ### Core.ScriptedAlarms-001
@@ -275,3 +307,74 @@ concurrent-read-while-writer scenario against a plain `Dictionary`
 comparisons + single-key reads against a quiesced engine. No code change
 required — the engine's correctness was always there; only the test-side
 contract was undocumented.
+
+### Core.ScriptedAlarms-014
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Location | `ScriptedAlarmEngine.cs:539-540` (pre-fix) |
+| Status | Resolved |
+
+**Description:** `EvaluatePredicateToStateAsync` had a nullable optional parameter
+`List<ScriptedAlarmEvent>? pendingEmissions = null` with a fallback branch
+`else FireEvent(evt); // LoadAsync path: no caller-supplied list, fire here.`
+The fallback was dead code — both callers (`LoadAsync` at line 267 and
+`ReevaluateAsync` at line 453) always pass an explicit `pending` list, so the
+`pendingEmissions is null` branch was never reachable. The comment was actively
+misleading: it implied a valid caller path that did not exist. Had a future
+developer called the method with `null` (or with a new call site that omitted the
+argument), the event would have been fired under `_evalGate` — directly
+reintroducing the deadlock hazard fixed by Core.ScriptedAlarms-003.
+
+**Recommendation:** Remove the nullable default and dead branch; make the parameter
+required (`List<ScriptedAlarmEvent> pendingEmissions`) so the compiler enforces the
+contract that all callers must supply a pending list.
+
+**Resolution:** Resolved 2026-06-19 — changed `pendingEmissions` from
+`List<ScriptedAlarmEvent>? pendingEmissions = null` to required
+`List<ScriptedAlarmEvent> pendingEmissions`; removed the dead `else FireEvent(evt)`
+branch and its misleading comment; added a `<remarks>` doc explaining why the
+parameter is required (preventing a future caller from accidentally firing events
+under the gate). Both existing callers already passed a non-null list so no call
+sites changed. Build and all 71 tests pass.
+
+---
+
+### Core.ScriptedAlarms-015
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Location | `ScriptedAlarmEngine.cs:284-298` (pre-fix) / `ScriptedAlarmEngine.cs:184-198` (post-fix) |
+| Status | Resolved |
+
+**Description:** `LoadAsync` disposed `_shelvingTimer` (the prior generation's 5-second
+timer) AFTER the `compileFailures` guard on line 241. If a reload call threw at the
+compile step (all alarms fail to compile), execution never reached the
+`_shelvingTimer?.Dispose()` call. The prior-generation timer kept firing against
+whatever `_alarms` remained after the partial clear + partial recompile — a window
+of inconsistency between the failed reload and the eventual `Dispose()` call.
+`Dispose()` itself does call `_shelvingTimer?.Dispose()` on line 734, so there is
+no permanent resource leak; the risk is unexpected shelving-check callbacks
+operating on partially-cleared state during that window. On a first-load failure
+(where `_loaded` remains `false`) the issue is benign since `EnsureLoaded()` gates
+all public API calls. On a reload failure (previously loaded) the timer runs against
+whatever alarms happened to compile before the bad ones.
+
+**Recommendation:** Move `_shelvingTimer?.Dispose(); _shelvingTimer = null;` to the
+very start of the `try` block, alongside `UnsubscribeFromUpstream()`, so the prior
+timer is always stopped before `_alarms` is cleared — regardless of whether the
+reload succeeds or fails.
+
+**Resolution:** Resolved 2026-06-19 — moved `_shelvingTimer?.Dispose(); _shelvingTimer = null;`
+to the start of the `try` block in `LoadAsync`, before `UnsubscribeFromUpstream()` and
+`_alarms.Clear()`. Removed the now-redundant `_shelvingTimer?.Dispose()` that previously
+appeared just before the new-timer assignment. Updated the comment at the new-timer
+assignment site to note that `_shelvingTimer` is guaranteed null here. Added regression
+test `Failed_reload_leaves_engine_recoverable_and_disposes_cleanly` documenting that
+after a failed reload + successful third load, `Dispose()` completes cleanly with no
+background tasks outliving the engine.
+
@@ -184,6 +184,17 @@ public sealed class ScriptedAlarmEngine : IDisposable
        await _evalGate.WaitAsync(ct).ConfigureAwait(false);
        try
        {
+            // Stop the prior shelving timer BEFORE clearing _alarms so no in-progress
+            // tick can observe a half-cleared dictionary. Moving the dispose here also
+            // means a failed reload (compile errors, store failures) stops the timer
+            // even though execution never reaches the _shelvingTimer = new Timer(...)
+            // assignment at the bottom of the try block. Without this, the old timer
+            // keeps firing against the partially-cleared _alarms until Dispose() is
+            // eventually called — not a permanent leak, but an unexpected side effect
+            // during the window. (Core.ScriptedAlarms-015)
+            _shelvingTimer?.Dispose();
+            _shelvingTimer = null;
+
            UnsubscribeFromUpstream();
            _alarms.Clear();
            _alarmsReferencing.Clear();
@@ -281,13 +292,10 @@ public sealed class ScriptedAlarmEngine : IDisposable
                _upstreamSubscriptions.Add(_upstream.SubscribeTag(path, OnUpstreamChange));
            _engineLogger.Information("ScriptedAlarmEngine loaded {Count} alarm(s)", _alarms.Count);

-            // Dispose any previously-created timer before reassigning; a second LoadAsync
-            // call without this would leave two timers firing against the same engine.
-            // (Core.ScriptedAlarms-002)
-            _shelvingTimer?.Dispose();
-
            // Start the shelving-check timer — ticks every 5s, expires any timed shelves
-            // that have passed their UnshelveAtUtc.
+            // that have passed their UnshelveAtUtc. The prior timer was already disposed
+            // at the START of this try block (Core.ScriptedAlarms-015), so _shelvingTimer
+            // is null here; no double-dispose risk. (Core.ScriptedAlarms-002, -015)
            _shelvingTimer = new Timer(_ => RunShelvingCheck(),
                null, TimeSpan.FromSeconds(5), TimeSpan.FromSeconds(5));
        }
@@ -480,9 +488,16 @@ public sealed class ScriptedAlarmEngine : IDisposable
    ///     them after releasing <c>_evalGate</c> — keeping subscriber callbacks
    ///     outside the gate. (Core.ScriptedAlarms-003)
    /// </summary>
+    /// <remarks>
+    ///     Every caller (LoadAsync and ReevaluateAsync) owns a <c>pending</c> list and
+    ///     passes it here; emissions are always deferred to after the gate is released.
+    ///     The parameter is required (non-nullable) to make this contract explicit and
+    ///     prevent a future caller from accidentally firing events under the gate.
+    ///     (Core.ScriptedAlarms-014)
+    /// </remarks>
    private async Task<AlarmConditionState> EvaluatePredicateToStateAsync(
        AlarmState state, AlarmConditionState seed, DateTime nowUtc, CancellationToken ct,
-        List<ScriptedAlarmEvent>? pendingEmissions = null)
+        List<ScriptedAlarmEvent> pendingEmissions)
    {
        // Look up (or lazily allocate) the per-alarm scratch and refill its read cache
        // in place. The dictionary + context survive across evaluations so the hot path
@@ -526,11 +541,7 @@ public sealed class ScriptedAlarmEngine : IDisposable
        if (result.Emission != EmissionKind.None)
        {
            var evt = BuildEmission(state, result.State, result.Emission);
-            if (evt is not null)
-            {
-                if (pendingEmissions is not null) pendingEmissions.Add(evt);
-                else FireEvent(evt); // LoadAsync path: no caller-supplied list, fire here.
-            }
+            if (evt is not null) pendingEmissions.Add(evt);
        }
        return result.State;
    }
@@ -947,6 +947,64 @@ public sealed class ScriptedAlarmEngineTests
        r.NoOpReason.ShouldBeNull("None() factory has no reason — only NoOp() carries one");
    }

+    // -------------------------------------------------------------------------
+    // Core.ScriptedAlarms-015: timer runs against partially-cleared _alarms on
+    // a failed LoadAsync reload. When a second LoadAsync throws at compile time,
+    // _shelvingTimer?.Dispose() (which lives AFTER the compileFailures throw in
+    // the pre-fix code) is skipped, so the old timer keeps firing against
+    // whatever _alarms was left in after the partial clear + partial recompile.
+    // The timer is eventually cleaned up by Dispose(), so there is no permanent
+    // resource leak; the observable risk is unexpected shelving-check side effects
+    // during the window between the failed reload and Dispose. The fix moves
+    // _shelvingTimer?.Dispose() to the START of the try block, alongside
+    // UnsubscribeFromUpstream, so the timer is always stopped before _alarms is
+    // cleared.
+    // -------------------------------------------------------------------------
+
+    /// <summary>Verifies that a failed second LoadAsync (compile errors) leaves the engine in a consistent
+    /// state: the old timer is eventually cleaned up by Dispose, a third successful LoadAsync restores
+    /// functionality, and no background task outlives the engine (Core.ScriptedAlarms-015).</summary>
+    [Fact]
+    public async Task Failed_reload_leaves_engine_recoverable_and_disposes_cleanly(/* -015 */)
+    {
+        var up = new FakeUpstream();
+        up.Set("Temp", 50);
+        var logger = new LoggerConfiguration().CreateLogger();
+        var store = new InMemoryAlarmStateStore();
+        var eng = new ScriptedAlarmEngine(up, store, new ScriptLoggerFactory(logger), logger);
+
+        // === First load: succeeds ===
+        await eng.LoadAsync([Alarm("A", "return false;")], TestContext.Current.CancellationToken);
+        eng.LoadedAlarmIds.ShouldContain("A");
+
+        // === Second load: fails at compile time ===
+        // The old timer (started by the first load) now runs against whatever _alarms
+        // contains after the partial clear + partial recompile. _shelvingTimer?.Dispose()
+        // is skipped because it lives after the compileFailures throw.
+        var ex = await Should.ThrowAsync<InvalidOperationException>(async () =>
+            await eng.LoadAsync([Alarm("B", "return cantCompileThis___!;")],
+                TestContext.Current.CancellationToken));
+        ex.Message.ShouldContain("did not compile");
+
+        // Engine is in a partially-cleared state — A is gone, B didn't compile.
+        eng.LoadedAlarmIds.ShouldNotContain("A");
+        eng.LoadedAlarmIds.ShouldNotContain("B");
+
+        // === Third load: succeeds — engine is still operational ===
+        up.Set("Temp", 50);
+        await eng.LoadAsync([Alarm("C", "return false;")], TestContext.Current.CancellationToken);
+        eng.LoadedAlarmIds.ShouldContain("C");
+
+        // Dispose must complete cleanly (no background tasks outlive the engine,
+        // _shelvingTimer is stopped). Before the fix, the old timer from the first
+        // load could keep firing between the failed second load and Dispose.
+        // After the fix, it is stopped at the START of every LoadAsync attempt
+        // (successful or not), so Dispose always finds a clean timer state.
+        var disposeTask = Task.Run(() => eng.Dispose());
+        await disposeTask.WaitAsync(TimeSpan.FromSeconds(3), TestContext.Current.CancellationToken);
+        // If Dispose threw or hung, the WaitAsync would surface it.
+    }
+
    private static async Task WaitForAsync(Func<bool> cond, int timeoutMs = 2000)
    {
        var deadline = DateTime.UtcNow.AddMilliseconds(timeoutMs);