fix(scripted-alarms): reuse per-alarm evaluation scratch on the hot path

Core.ScriptedAlarms-009 resolution: replace the per-call Dictionary + AlarmPredicateContext allocation with a per-alarm reusable AlarmScratch held in _scratchByAlarmId, refilled in place under _evalGate on each evaluation. The hot path no longer allocates per upstream tag change. Why this matters: On a busy line where many tags feeding many alarms change frequently, the old BuildReadCache allocated a fresh dictionary + context on every predicate evaluation — a steady stream of short-lived allocations the GC eventually has to reclaim. With the reuse, the dictionary and context are allocated once per alarm (on first evaluation) and refilled in place across every subsequent re-eval. Implementation: - New private AlarmScratch class holds the reusable Dictionary<string, DataValueSnapshot> read cache (pre-sized to the alarm's Inputs.Count) and the AlarmPredicateContext that wraps it by reference. The context observes refilled values without being re-created. - ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId on the engine, cleared in LoadAsync alongside _alarms so a config-publish drops the prior generation's scratch (Inputs / Logger may change). - EvaluatePredicateToStateAsync looks up scratch via GetOrAdd, calls the new RefillReadCache(Dictionary, IReadOnlySet) helper to clear + repopulate the dictionary in place, then runs the predicate against the reused context. - BuildReadCache removed. Safety: Reuse is serialised under _evalGate which guarantees no two threads ever observe the same scratch in a half-refilled state. The AlarmPredicateContext is bound to the scratch dictionary by reference, so the predicate's ctx.GetTag(path) sees the freshly-refilled values rather than a stale snapshot. Verification: - All 66 ScriptedAlarms tests pass (was 63 — three new regression tests locking the reuse contract). - All 56 VirtualTags tests still pass (unchanged). - All 104 Core.Scripting tests still pass (unchanged). New tests in ScriptedAlarmEngineTests: - Reevaluation_reuses_the_same_read_cache_dictionary — asserts ReferenceEquals(scratch_before, scratch_after) across two evaluations of the same alarm. - Reevaluation_reuses_the_same_predicate_context — same, for the context. - LoadAsync_drops_the_prior_generations_scratch — asserts a config publish wipes the prior scratch (so a stale Logger / Inputs can't leak into the new generation). Internal test hooks TryGetScratchReadCacheForTest / TryGetScratchContextForTest added via the existing InternalsVisibleTo for the tests project. Kept internal — not part of the public engine surface. Docs: - docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section rewritten as "hot-path allocation reuse" documenting the new contract + reuse safety reasoning + the three regression tests. - code-reviews/Core.ScriptedAlarms/findings.md -009 flipped Won't Fix → Resolved. - code-reviews/README.md regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:10:09 -04:00
parent 7b6ab2ec6f
commit 0001cdd579
5 changed files with 202 additions and 13 deletions
@@ -156,13 +156,29 @@ a category produced nothing rather than leaving it blank.
 | Severity | Low |
 | Category | Performance & resource management |
 | Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
-| Status | Won't Fix |
+| Status | Resolved |

 **Description:** `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281).

 **Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested.

-**Resolution:** Won't Fix 2026-05-23 — per the recommendation, no code change. Documented the known allocation characteristic in `docs/v2/Galaxy.Performance.md` (new "Scripted-alarm engine — known hot-path allocations" section) so a future soak that surfaces pressure has a noted mitigation (reused per-alarm scratch buffer) and we don't re-find this in a later review.
+**Resolution:** Resolved 2026-05-23 — added a per-alarm reusable `AlarmScratch`
+(read-cache `Dictionary` + `AlarmPredicateContext`) held in
+`_scratchByAlarmId`, populated lazily on first evaluation and refilled in place
+by the new `RefillReadCache(Dictionary, IReadOnlySet)` helper on every
+subsequent re-eval. `BuildReadCache` is gone. The reuse is safe because every
+evaluation runs under `_evalGate`; the context wraps the dictionary by
+reference, so the predicate's `ctx.GetTag(path)` sees the freshly-refilled
+values. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a
+config-publish drops the prior generation's scratch (Inputs / Logger may
+change). Regression tests added in `ScriptedAlarmEngineTests`:
+`Reevaluation_reuses_the_same_read_cache_dictionary`,
+`Reevaluation_reuses_the_same_predicate_context`, and
+`LoadAsync_drops_the_prior_generations_scratch`; internal test hooks
+`TryGetScratchReadCacheForTest` / `TryGetScratchContextForTest` exposed via
+the existing `InternalsVisibleTo`. `docs/v2/Galaxy.Performance.md` "Scripted-alarm
+engine" section rewritten to document the new reuse contract. Suite now 66
+green (was 63).

 ### Core.ScriptedAlarms-010

@@ -281,7 +281,7 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
 | Core.ScriptedAlarms-003 | Low | Resolved | Documentation & comments | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` |
 | Core.ScriptedAlarms-006 | Low | Resolved | Concurrency & thread safety | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` |
 | Core.ScriptedAlarms-008 | Low | Resolved | Performance & resource management | `Part9StateMachine.cs:261-268` |
-| Core.ScriptedAlarms-009 | Low | Won't Fix | Performance & resource management | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
+| Core.ScriptedAlarms-009 | Low | Resolved | Performance & resource management | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
 | Core.ScriptedAlarms-010 | Low | Resolved | Design-document adherence | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` |
 | Core.ScriptedAlarms-011 | Low | Resolved | Code organization & conventions | `Part9StateMachine.cs:275` |
 | Core.Scripting-005 | Low | Resolved | Correctness & logic bugs | `DependencyExtractor.cs:97` |
@@ -151,8 +151,11 @@ substantive driver change, and revise this table when the data does.
   `SubscriptionRegistry`, or a downstream consumer retaining
   `DataValueSnapshot` references past their useful life.

-## Scripted-alarm engine — known hot-path allocations
+## Scripted-alarm engine — hot-path allocation reuse

-`ScriptedAlarmEngine.BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` and `AlarmPredicateContext` on every predicate evaluation — i.e. once per upstream tag change per referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. (Core.ScriptedAlarms-009)
+`ScriptedAlarmEngine` keeps a per-alarm reusable evaluation scratch in `_scratchByAlarmId` — the read-cache `Dictionary<string, DataValueSnapshot>` and the `AlarmPredicateContext` are allocated once per alarm (on first evaluation) and refilled in place across every subsequent predicate evaluation. The hot path no longer allocates a fresh dictionary + context per upstream tag change. (Core.ScriptedAlarms-009)

-The allocations are deliberate for now: predicate evaluation is already serialised under `_evalGate`, so a single reused scratch dictionary would be safe, but the per-call dictionary keeps the evaluation surface immutable and trivially safe against future refactors. If a future scripted-alarm soak surfaces allocation pressure on this path, the mitigation is a per-alarm scratch buffer cleared between evaluations — note here before changing the engine.
+Safety: reuse is serialised under `_evalGate`, so two threads can never observe the same scratch in a half-refilled state. The context wraps the read-cache by reference, so refilling the dictionary is what the predicate's `ctx.GetTag(path)` calls observe. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a config-publish drops the prior generation's scratch (a new generation may carry different `Inputs` / `Logger`). Regression tests in `ScriptedAlarmEngineTests` lock the reuse contract:
+- `Reevaluation_reuses_the_same_read_cache_dictionary` — asserts dictionary instance identity across two evaluations.
+- `Reevaluation_reuses_the_same_predicate_context` — same, for the context.
+- `LoadAsync_drops_the_prior_generations_scratch` — asserts a publish resets the scratch.
@@ -48,6 +48,37 @@ public sealed class ScriptedAlarmEngine : IDisposable
    // snapshot enumeration safe. The only write shapes are indexer-set and Clear,
    // both of which ConcurrentDictionary supports atomically. (Core.ScriptedAlarms-001)
    private readonly ConcurrentDictionary<string, AlarmState> _alarms = new(StringComparer.Ordinal);
+
+    /// <summary>
+    ///     Per-alarm reusable evaluation scratch. The read-cache dictionary and the
+    ///     <see cref="AlarmPredicateContext"/> instance are both allocated once per
+    ///     alarm (on first evaluation) and reused across every subsequent re-eval —
+    ///     the hot path no longer allocates a fresh dictionary + context per upstream
+    ///     tag change. Safe because <see cref="EvaluatePredicateToStateAsync"/> only
+    ///     runs under <see cref="_evalGate"/>, which serialises every evaluation:
+    ///     two threads can never observe the same scratch in a half-refilled state.
+    ///     Cleared in <see cref="LoadAsync"/> alongside <see cref="_alarms"/>.
+    ///     (Core.ScriptedAlarms-009)
+    /// </summary>
+    private readonly ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId =
+        new(StringComparer.Ordinal);
+
+    /// <summary>
+    ///     Test-only diagnostic: returns the per-alarm scratch read-cache dictionary
+    ///     if one has been allocated, else null. Used by Core.ScriptedAlarms-009
+    ///     regression tests to assert the scratch is reused across evaluations
+    ///     (two reads return the same instance).
+    /// </summary>
+    internal IReadOnlyDictionary<string, DataValueSnapshot>? TryGetScratchReadCacheForTest(string alarmId)
+        => _scratchByAlarmId.TryGetValue(alarmId, out var s) ? s.ReadCache : null;
+
+    /// <summary>
+    ///     Test-only diagnostic: returns the per-alarm <see cref="AlarmPredicateContext"/>
+    ///     if one has been allocated, else null. Companion to
+    ///     <see cref="TryGetScratchReadCacheForTest"/>.
+    /// </summary>
+    internal AlarmPredicateContext? TryGetScratchContextForTest(string alarmId)
+        => _scratchByAlarmId.TryGetValue(alarmId, out var s) ? s.Context : null;
    private readonly ConcurrentDictionary<string, DataValueSnapshot> _valueCache
        = new(StringComparer.Ordinal);
    private readonly Dictionary<string, HashSet<string>> _alarmsReferencing
@@ -108,6 +139,10 @@ public sealed class ScriptedAlarmEngine : IDisposable
            UnsubscribeFromUpstream();
            _alarms.Clear();
            _alarmsReferencing.Clear();
+            // Drop the prior generation's per-alarm scratch buffers — definitions may
+            // have changed (different Inputs, different Logger), so any reuse would be
+            // unsafe. (Core.ScriptedAlarms-009)
+            _scratchByAlarmId.Clear();

            var compileFailures = new List<string>();
            foreach (var def in definitions)
@@ -354,7 +389,13 @@ public sealed class ScriptedAlarmEngine : IDisposable
        AlarmState state, AlarmConditionState seed, DateTime nowUtc, CancellationToken ct,
        List<ScriptedAlarmEvent>? pendingEmissions = null)
    {
-        var inputs = BuildReadCache(state.Inputs);
+        // Look up (or lazily allocate) the per-alarm scratch and refill its read cache
+        // in place. The dictionary + context survive across evaluations so the hot path
+        // no longer allocates per upstream tag change. (Core.ScriptedAlarms-009)
+        var scratch = _scratchByAlarmId.GetOrAdd(
+            state.Definition.AlarmId,
+            _ => new AlarmScratch(state.Inputs, state.Logger, _clock));
+        RefillReadCache(scratch.ReadCache, state.Inputs);

        // Cold-start guard — skip the predicate when any referenced upstream tag has no
        // cached value yet (the upstream subscription hasn't delivered its first push).
@@ -362,9 +403,9 @@ public sealed class ScriptedAlarmEngine : IDisposable
        // every tick until the cache fills, spamming the log with identical stack traces.
        // Bad quality is treated the same: the input isn't available at the predicate's
        // expected type, so the only defensible move is to hold the prior condition state.
-        if (!AreInputsReady(inputs)) return seed;
+        if (!AreInputsReady(scratch.ReadCache)) return seed;

-        var context = new AlarmPredicateContext(inputs, state.Logger, _clock);
+        var context = scratch.Context;

        bool predicateTrue;
        try
@@ -399,12 +440,20 @@ public sealed class ScriptedAlarmEngine : IDisposable
        return result.State;
    }

-    private IReadOnlyDictionary<string, DataValueSnapshot> BuildReadCache(IReadOnlySet<string> inputs)
+    /// <summary>
+    ///     Refill <paramref name="cache"/> in place from <c>_valueCache</c>, falling
+    ///     back to a synchronous <c>ITagUpstreamSource.ReadTag</c> for paths whose
+    ///     first upstream push hasn't arrived yet. The dictionary is cleared and
+    ///     repopulated under <c>_evalGate</c> so no concurrent reader can observe
+    ///     a partial state. Replaces the old <c>BuildReadCache</c> which allocated a
+    ///     fresh dictionary every call (Core.ScriptedAlarms-009).
+    /// </summary>
+    private void RefillReadCache(
+        Dictionary<string, DataValueSnapshot> cache, IReadOnlySet<string> inputs)
    {
-        var d = new Dictionary<string, DataValueSnapshot>(StringComparer.Ordinal);
+        cache.Clear();
        foreach (var p in inputs)
-            d[p] = _valueCache.TryGetValue(p, out var v) ? v : _upstream.ReadTag(p);
-        return d;
+            cache[p] = _valueCache.TryGetValue(p, out var v) ? v : _upstream.ReadTag(p);
    }

    /// <summary>
@@ -611,6 +660,37 @@ public sealed class ScriptedAlarmEngine : IDisposable
        IReadOnlyList<string> TemplateTokens,
        ILogger Logger,
        AlarmConditionState Condition);
+
+    /// <summary>
+    ///     Per-alarm reusable evaluation scratch. The <see cref="ReadCache"/> dictionary
+    ///     is the same instance across every evaluation of the owning alarm — it is
+    ///     cleared and refilled in <see cref="ScriptedAlarmEngine.RefillReadCache"/> on
+    ///     each call. <see cref="Context"/> wraps that dictionary by reference, so a
+    ///     refilled <see cref="ReadCache"/> is what the predicate's
+    ///     <c>ctx.GetTag(path)</c> calls observe. (Core.ScriptedAlarms-009)
+    /// </summary>
+    /// <remarks>
+    ///     Reuse is safe because <see cref="ScriptedAlarmEngine"/> serialises every
+    ///     evaluation under <c>_evalGate</c>: two threads can never observe the same
+    ///     scratch in a half-refilled state.
+    /// </remarks>
+    private sealed class AlarmScratch
+    {
+        public Dictionary<string, DataValueSnapshot> ReadCache { get; }
+        public AlarmPredicateContext Context { get; }
+
+        public AlarmScratch(IReadOnlySet<string> inputs, ILogger logger, Func<DateTime> clock)
+        {
+            // Pre-size to the expected input count so the first refill doesn't pay the
+            // dictionary-grow cost. The dictionary auto-grows if Inputs changes (it
+            // cannot under the current contract — Inputs is fixed at LoadAsync — but
+            // pre-sizing is defensive against future changes).
+            ReadCache = new Dictionary<string, DataValueSnapshot>(inputs.Count, StringComparer.Ordinal);
+            // Context holds the read cache by reference. Refilling the dictionary
+            // updates what the context (and the script) observes.
+            Context = new AlarmPredicateContext(ReadCache, logger, clock);
+        }
+    }
 }

 /// <summary>
@@ -925,4 +925,94 @@ public sealed class ScriptedAlarmEngineTests
        public Task RemoveAsync(string alarmId, CancellationToken ct)
            => _inner.RemoveAsync(alarmId, ct);
    }
+
+    // --- Core.ScriptedAlarms-009: per-alarm evaluation-scratch reuse ---
+
+    [Fact]
+    public async Task Reevaluation_reuses_the_same_read_cache_dictionary()
+    {
+        // Pre-009 the engine allocated a fresh Dictionary<string, DataValueSnapshot>
+        // on every upstream-change tick; on a busy line this was a steady allocation
+        // stream on the hot path. The fix: one dictionary per alarm, refilled in place
+        // under _evalGate. Test asserts the dictionary instance is identical across
+        // two consecutive evaluations of the same alarm.
+        var up = new FakeUpstream();
+        up.Set("Temp", 50);
+        using var eng = Build(up, out _);
+        await eng.LoadAsync(
+            [Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
+            TestContext.Current.CancellationToken);
+
+        // First evaluation runs during LoadAsync. Capture the scratch reference now.
+        var scratchAfterLoad = eng.TryGetScratchReadCacheForTest("HighTemp");
+        scratchAfterLoad.ShouldNotBeNull(
+            "the scratch should have been allocated during LoadAsync's initial evaluation");
+
+        // Trigger a re-evaluation by pushing an upstream change.
+        up.Push("Temp", 150);
+        await WaitForAsync(() =>
+            eng.GetState("HighTemp")!.Active == AlarmActiveState.Active);
+
+        var scratchAfterPush = eng.TryGetScratchReadCacheForTest("HighTemp");
+        ReferenceEquals(scratchAfterLoad, scratchAfterPush).ShouldBeTrue(
+            "the read-cache dictionary must be the *same* instance across evaluations " +
+            "(Core.ScriptedAlarms-009) — a per-call allocation would defeat the fix.");
+        scratchAfterPush!["Temp"].Value.ShouldBe(150, "refill must update the existing dictionary in place");
+    }
+
+    [Fact]
+    public async Task Reevaluation_reuses_the_same_predicate_context()
+    {
+        // The context wraps the read-cache by reference; refilling the dictionary
+        // updates what the script sees. Reusing the context spares a per-call object
+        // allocation as well as the dictionary one.
+        var up = new FakeUpstream();
+        up.Set("Temp", 50);
+        using var eng = Build(up, out _);
+        await eng.LoadAsync(
+            [Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
+            TestContext.Current.CancellationToken);
+
+        var ctxAfterLoad = eng.TryGetScratchContextForTest("HighTemp");
+        ctxAfterLoad.ShouldNotBeNull();
+
+        up.Push("Temp", 150);
+        await WaitForAsync(() =>
+            eng.GetState("HighTemp")!.Active == AlarmActiveState.Active);
+
+        var ctxAfterPush = eng.TryGetScratchContextForTest("HighTemp");
+        ReferenceEquals(ctxAfterLoad, ctxAfterPush).ShouldBeTrue(
+            "the AlarmPredicateContext must be reused across evaluations (Core.ScriptedAlarms-009).");
+    }
+
+    [Fact]
+    public async Task LoadAsync_drops_the_prior_generations_scratch()
+    {
+        // A config-publish recreates AlarmStates with potentially different Inputs +
+        // Loggers; reusing the prior generation's scratch would attach an outdated
+        // logger to the new alarm. LoadAsync must clear _scratchByAlarmId so the
+        // next evaluation lazily re-allocates against the fresh AlarmState.
+        var up = new FakeUpstream();
+        up.Set("Temp", 50);
+        using var eng = Build(up, out _);
+        await eng.LoadAsync(
+            [Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
+            TestContext.Current.CancellationToken);
+
+        var scratchAfterFirstLoad = eng.TryGetScratchReadCacheForTest("HighTemp");
+        scratchAfterFirstLoad.ShouldNotBeNull();
+
+        // Second LoadAsync — same alarm id, same predicate, but the scratch should be
+        // wiped and re-allocated on the next evaluation. (LoadAsync itself triggers a
+        // first evaluation, so the scratch is reborn before we look.)
+        await eng.LoadAsync(
+            [Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
+            TestContext.Current.CancellationToken);
+
+        var scratchAfterSecondLoad = eng.TryGetScratchReadCacheForTest("HighTemp");
+        scratchAfterSecondLoad.ShouldNotBeNull();
+        ReferenceEquals(scratchAfterFirstLoad, scratchAfterSecondLoad).ShouldBeFalse(
+            "LoadAsync must drop the prior generation's scratch — reuse across a publish " +
+            "would attach a stale Logger / Inputs to the new alarm definition.");
+    }
 }