From 272a9da61e9dd16d22ea2cd19934daa3493f2a9d Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Fri, 19 Jun 2026 11:21:35 -0400 Subject: [PATCH] review(Core.ScriptedAlarms): stop shelving timer on failed reload + drop dead branch Re-review at 7286d320. -015: dispose shelving timer at top of LoadAsync so a failed reload doesn't leave it firing against partially-cleared state + test. -014: make pendingEmissions required (removes unreachable fire-under-gate branch that could reintroduce the -003 deadlock). --- code-reviews/Core.ScriptedAlarms/findings.md | 107 +++++++++++++++++- .../ScriptedAlarmEngine.cs | 35 ++++-- .../ScriptedAlarmEngineTests.cs | 58 ++++++++++ 3 files changed, 186 insertions(+), 14 deletions(-) diff --git a/code-reviews/Core.ScriptedAlarms/findings.md b/code-reviews/Core.ScriptedAlarms/findings.md index 71f139b5..d643aa33 100644 --- a/code-reviews/Core.ScriptedAlarms/findings.md +++ b/code-reviews/Core.ScriptedAlarms/findings.md @@ -4,8 +4,8 @@ |---|---| | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` | | Reviewer | Claude Code | -| Review date | 2026-05-23 | -| Commit reviewed | `a9be809` | +| Review date | 2026-06-19 | +| Commit reviewed | `7286d320` | | Status | Reviewed | | Open findings | 0 | @@ -49,6 +49,38 @@ re-examined; existing closed findings stay as audit trail. | 9 | Testing coverage | No issues found | | 10 | Documentation & comments | Core.ScriptedAlarms-013 | +### 2026-06-19 re-review (commit `7286d320`) + +Full re-review at HEAD. Covers all commits since `a9be809`: `AlarmPredicateContext` +moved to `Core.Scripting.Abstractions`; `ScriptedAlarmEvent` gained `Comment` and +`HistorizeToAveva` fields; `AlarmAcknowledgeRequest.OperatorUser` wired through +`ScriptedAlarmSource`; `CompiledScriptCache` routing added; `_alarms.Clear()` in +`Dispose` re-enabled under `-016`; mass XML-doc backfill; CVE patches. All prior +findings remain Resolved. Cross-module observation noted below checklist +(not recorded as a module finding — root cause is in Server). + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | No issues found | +| 2 | OtOpcUa conventions | No issues found | +| 3 | Concurrency & thread safety | No issues found | +| 4 | Error handling & resilience | Core.ScriptedAlarms-015 | +| 5 | Security | No issues found | +| 6 | Performance & resource management | No issues found | +| 7 | Design-document adherence | No issues found | +| 8 | Code organization & conventions | Core.ScriptedAlarms-014 | +| 9 | Testing coverage | No issues found | +| 10 | Documentation & comments | No issues found | + +**Cross-module observation (2026-06-19):** `OtOpcUaNodeManager.cs:677` constructs an +`AlarmCommand` with `User: string.Empty` for the system-timer `OnTimedUnshelve` bypass +path. This empty string flows to `ScriptedAlarmHostActor.OnAlarmCommand` → +`_engine.UnshelveAsync("", ...)` → `Part9StateMachine.ApplyUnshelve` which throws +`ArgumentException` for `IsNullOrWhiteSpace(user)`. The exception is caught and +logged at the actor level (not a crash), but the auto-unshelve silently fails. Root +cause is in the Server module (`OtOpcUaNodeManager.cs`); the engine's user-validation +guard is correct. Fix belongs in a Server module review. + ## Findings ### Core.ScriptedAlarms-001 @@ -275,3 +307,74 @@ concurrent-read-while-writer scenario against a plain `Dictionary` comparisons + single-key reads against a quiesced engine. No code change required — the engine's correctness was always there; only the test-side contract was undocumented. + +### Core.ScriptedAlarms-014 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Code organization & conventions | +| Location | `ScriptedAlarmEngine.cs:539-540` (pre-fix) | +| Status | Resolved | + +**Description:** `EvaluatePredicateToStateAsync` had a nullable optional parameter +`List? pendingEmissions = null` with a fallback branch +`else FireEvent(evt); // LoadAsync path: no caller-supplied list, fire here.` +The fallback was dead code — both callers (`LoadAsync` at line 267 and +`ReevaluateAsync` at line 453) always pass an explicit `pending` list, so the +`pendingEmissions is null` branch was never reachable. The comment was actively +misleading: it implied a valid caller path that did not exist. Had a future +developer called the method with `null` (or with a new call site that omitted the +argument), the event would have been fired under `_evalGate` — directly +reintroducing the deadlock hazard fixed by Core.ScriptedAlarms-003. + +**Recommendation:** Remove the nullable default and dead branch; make the parameter +required (`List pendingEmissions`) so the compiler enforces the +contract that all callers must supply a pending list. + +**Resolution:** Resolved 2026-06-19 — changed `pendingEmissions` from +`List? pendingEmissions = null` to required +`List pendingEmissions`; removed the dead `else FireEvent(evt)` +branch and its misleading comment; added a `` doc explaining why the +parameter is required (preventing a future caller from accidentally firing events +under the gate). Both existing callers already passed a non-null list so no call +sites changed. Build and all 71 tests pass. + +--- + +### Core.ScriptedAlarms-015 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `ScriptedAlarmEngine.cs:284-298` (pre-fix) / `ScriptedAlarmEngine.cs:184-198` (post-fix) | +| Status | Resolved | + +**Description:** `LoadAsync` disposed `_shelvingTimer` (the prior generation's 5-second +timer) AFTER the `compileFailures` guard on line 241. If a reload call threw at the +compile step (all alarms fail to compile), execution never reached the +`_shelvingTimer?.Dispose()` call. The prior-generation timer kept firing against +whatever `_alarms` remained after the partial clear + partial recompile — a window +of inconsistency between the failed reload and the eventual `Dispose()` call. +`Dispose()` itself does call `_shelvingTimer?.Dispose()` on line 734, so there is +no permanent resource leak; the risk is unexpected shelving-check callbacks +operating on partially-cleared state during that window. On a first-load failure +(where `_loaded` remains `false`) the issue is benign since `EnsureLoaded()` gates +all public API calls. On a reload failure (previously loaded) the timer runs against +whatever alarms happened to compile before the bad ones. + +**Recommendation:** Move `_shelvingTimer?.Dispose(); _shelvingTimer = null;` to the +very start of the `try` block, alongside `UnsubscribeFromUpstream()`, so the prior +timer is always stopped before `_alarms` is cleared — regardless of whether the +reload succeeds or fails. + +**Resolution:** Resolved 2026-06-19 — moved `_shelvingTimer?.Dispose(); _shelvingTimer = null;` +to the start of the `try` block in `LoadAsync`, before `UnsubscribeFromUpstream()` and +`_alarms.Clear()`. Removed the now-redundant `_shelvingTimer?.Dispose()` that previously +appeared just before the new-timer assignment. Updated the comment at the new-timer +assignment site to note that `_shelvingTimer` is guaranteed null here. Added regression +test `Failed_reload_leaves_engine_recoverable_and_disposes_cleanly` documenting that +after a failed reload + successful third load, `Dispose()` completes cleanly with no +background tasks outliving the engine. + diff --git a/src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs b/src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs index 7ef1c6e0..9b14507d 100644 --- a/src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs +++ b/src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs @@ -184,6 +184,17 @@ public sealed class ScriptedAlarmEngine : IDisposable await _evalGate.WaitAsync(ct).ConfigureAwait(false); try { + // Stop the prior shelving timer BEFORE clearing _alarms so no in-progress + // tick can observe a half-cleared dictionary. Moving the dispose here also + // means a failed reload (compile errors, store failures) stops the timer + // even though execution never reaches the _shelvingTimer = new Timer(...) + // assignment at the bottom of the try block. Without this, the old timer + // keeps firing against the partially-cleared _alarms until Dispose() is + // eventually called — not a permanent leak, but an unexpected side effect + // during the window. (Core.ScriptedAlarms-015) + _shelvingTimer?.Dispose(); + _shelvingTimer = null; + UnsubscribeFromUpstream(); _alarms.Clear(); _alarmsReferencing.Clear(); @@ -281,13 +292,10 @@ public sealed class ScriptedAlarmEngine : IDisposable _upstreamSubscriptions.Add(_upstream.SubscribeTag(path, OnUpstreamChange)); _engineLogger.Information("ScriptedAlarmEngine loaded {Count} alarm(s)", _alarms.Count); - // Dispose any previously-created timer before reassigning; a second LoadAsync - // call without this would leave two timers firing against the same engine. - // (Core.ScriptedAlarms-002) - _shelvingTimer?.Dispose(); - // Start the shelving-check timer — ticks every 5s, expires any timed shelves - // that have passed their UnshelveAtUtc. + // that have passed their UnshelveAtUtc. The prior timer was already disposed + // at the START of this try block (Core.ScriptedAlarms-015), so _shelvingTimer + // is null here; no double-dispose risk. (Core.ScriptedAlarms-002, -015) _shelvingTimer = new Timer(_ => RunShelvingCheck(), null, TimeSpan.FromSeconds(5), TimeSpan.FromSeconds(5)); } @@ -480,9 +488,16 @@ public sealed class ScriptedAlarmEngine : IDisposable /// them after releasing _evalGate — keeping subscriber callbacks /// outside the gate. (Core.ScriptedAlarms-003) /// + /// + /// Every caller (LoadAsync and ReevaluateAsync) owns a pending list and + /// passes it here; emissions are always deferred to after the gate is released. + /// The parameter is required (non-nullable) to make this contract explicit and + /// prevent a future caller from accidentally firing events under the gate. + /// (Core.ScriptedAlarms-014) + /// private async Task EvaluatePredicateToStateAsync( AlarmState state, AlarmConditionState seed, DateTime nowUtc, CancellationToken ct, - List? pendingEmissions = null) + List pendingEmissions) { // Look up (or lazily allocate) the per-alarm scratch and refill its read cache // in place. The dictionary + context survive across evaluations so the hot path @@ -526,11 +541,7 @@ public sealed class ScriptedAlarmEngine : IDisposable if (result.Emission != EmissionKind.None) { var evt = BuildEmission(state, result.State, result.Emission); - if (evt is not null) - { - if (pendingEmissions is not null) pendingEmissions.Add(evt); - else FireEvent(evt); // LoadAsync path: no caller-supplied list, fire here. - } + if (evt is not null) pendingEmissions.Add(evt); } return result.State; } diff --git a/tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs b/tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs index b4061f15..396f8ac5 100644 --- a/tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs +++ b/tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs @@ -947,6 +947,64 @@ public sealed class ScriptedAlarmEngineTests r.NoOpReason.ShouldBeNull("None() factory has no reason — only NoOp() carries one"); } + // ------------------------------------------------------------------------- + // Core.ScriptedAlarms-015: timer runs against partially-cleared _alarms on + // a failed LoadAsync reload. When a second LoadAsync throws at compile time, + // _shelvingTimer?.Dispose() (which lives AFTER the compileFailures throw in + // the pre-fix code) is skipped, so the old timer keeps firing against + // whatever _alarms was left in after the partial clear + partial recompile. + // The timer is eventually cleaned up by Dispose(), so there is no permanent + // resource leak; the observable risk is unexpected shelving-check side effects + // during the window between the failed reload and Dispose. The fix moves + // _shelvingTimer?.Dispose() to the START of the try block, alongside + // UnsubscribeFromUpstream, so the timer is always stopped before _alarms is + // cleared. + // ------------------------------------------------------------------------- + + /// Verifies that a failed second LoadAsync (compile errors) leaves the engine in a consistent + /// state: the old timer is eventually cleaned up by Dispose, a third successful LoadAsync restores + /// functionality, and no background task outlives the engine (Core.ScriptedAlarms-015). + [Fact] + public async Task Failed_reload_leaves_engine_recoverable_and_disposes_cleanly(/* -015 */) + { + var up = new FakeUpstream(); + up.Set("Temp", 50); + var logger = new LoggerConfiguration().CreateLogger(); + var store = new InMemoryAlarmStateStore(); + var eng = new ScriptedAlarmEngine(up, store, new ScriptLoggerFactory(logger), logger); + + // === First load: succeeds === + await eng.LoadAsync([Alarm("A", "return false;")], TestContext.Current.CancellationToken); + eng.LoadedAlarmIds.ShouldContain("A"); + + // === Second load: fails at compile time === + // The old timer (started by the first load) now runs against whatever _alarms + // contains after the partial clear + partial recompile. _shelvingTimer?.Dispose() + // is skipped because it lives after the compileFailures throw. + var ex = await Should.ThrowAsync(async () => + await eng.LoadAsync([Alarm("B", "return cantCompileThis___!;")], + TestContext.Current.CancellationToken)); + ex.Message.ShouldContain("did not compile"); + + // Engine is in a partially-cleared state — A is gone, B didn't compile. + eng.LoadedAlarmIds.ShouldNotContain("A"); + eng.LoadedAlarmIds.ShouldNotContain("B"); + + // === Third load: succeeds — engine is still operational === + up.Set("Temp", 50); + await eng.LoadAsync([Alarm("C", "return false;")], TestContext.Current.CancellationToken); + eng.LoadedAlarmIds.ShouldContain("C"); + + // Dispose must complete cleanly (no background tasks outlive the engine, + // _shelvingTimer is stopped). Before the fix, the old timer from the first + // load could keep firing between the failed second load and Dispose. + // After the fix, it is stopped at the START of every LoadAsync attempt + // (successful or not), so Dispose always finds a clean timer state. + var disposeTask = Task.Run(() => eng.Dispose()); + await disposeTask.WaitAsync(TimeSpan.FromSeconds(3), TestContext.Current.CancellationToken); + // If Dispose threw or hung, the WaitAsync would surface it. + } + private static async Task WaitForAsync(Func cond, int timeoutMs = 2000) { var deadline = DateTime.UtcNow.AddMilliseconds(timeoutMs);