From 321d57938f0ad4328af34a4c0378bf3f8083e883 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 7 Jun 2026 14:55:28 -0400 Subject: [PATCH] docs(design): VirtualTag/script memory scalability (A0+A+guardrail; C2 deferred) + measurement harness --- ...6-06-07-virtualtag-script-memory-design.md | 180 ++++++++++++++++++ tools/mem-probe/.gitignore | 2 + .../mem-probe/LeanContext/LeanContext.csproj | 18 ++ tools/mem-probe/LeanContext/LeanCtx.cs | 36 ++++ tools/mem-probe/MemProbe/MemProbe.csproj | 23 +++ tools/mem-probe/MemProbe/Program.cs | 78 ++++++++ 6 files changed, 337 insertions(+) create mode 100644 docs/plans/2026-06-07-virtualtag-script-memory-design.md create mode 100644 tools/mem-probe/.gitignore create mode 100644 tools/mem-probe/LeanContext/LeanContext.csproj create mode 100644 tools/mem-probe/LeanContext/LeanCtx.cs create mode 100644 tools/mem-probe/MemProbe/MemProbe.csproj create mode 100644 tools/mem-probe/MemProbe/Program.cs diff --git a/docs/plans/2026-06-07-virtualtag-script-memory-design.md b/docs/plans/2026-06-07-virtualtag-script-memory-design.md new file mode 100644 index 00000000..408ba977 --- /dev/null +++ b/docs/plans/2026-06-07-virtualtag-script-memory-design.md @@ -0,0 +1,180 @@ +# VirtualTag / Script Memory Scalability — Design (2026-06-07) + +## Problem + +Deploying the Northwind company overlay (1036 `VirtualTag`s, each a one-line mirror script +`return ctx.GetTag("…").Value;`) OOM-killed the central nodes — even with a 4 GiB container limit, +on a 15.6 GiB Docker VM. The node materialised the address space and spawned 1036 `VirtualTagActor`s, +then died as their scripts compiled. + +### Root cause (measured) + +The per-script cost of Roslyn C# scripting is dominated by Roslyn's reference manager materialising +the **transitive reference closure of the assembly that contains the script's `globalsType`** — and +it is almost entirely **unmanaged** memory (a managed-heap snapshot barely moves). This is the +long-standing, unfixed `dotnet/roslyn#22219` ("Backlog"): the reporter measured ~50 MiB/script and +OOM at ~39 scripts; the confirmed mitigation in that thread is "move the globalsType to a lean +assembly." + +Our globals type reaches Roslyn: `ScriptGlobals` → `Core.VirtualTags` → +`Core.Scripting` → `Microsoft.CodeAnalysis.CSharp.Scripting`. So every compile pays for the whole +Roslyn metadata closure. + +### Measurement (probe in `tools/mem-probe/`, Roslyn 4.12.0, 50 retained scripts, 5 runs each) + +| globals closure | per-script RSS | per-script managed | +|---|---|---| +| **Heavy** — `VirtualTagContext` (reaches Roslyn) — *today* | **~18.2 MiB** (±25% noise) | 0.18 MiB | +| **Lean** — context in an Abstractions-only assembly (no Roslyn) | **~1.66 MiB** | 0.05 MiB | + +→ **~11× reduction, ~99% unmanaged.** Real cost today ≈ **18 MiB/script** → 1036 vtags ≈ **~18 GiB** +(explains the instant OOM even at 4 GiB). The earlier "~3.5 MiB" guess was ~5× too low. + +### Corpus survey (decides the Phase-2 grammar) + +Real VirtualTag and ScriptedAlarm scripts use a **small bounded statement grammar**: tag reads +(`ctx.GetTag("lit").Value` / `.StatusCode`), explicit casts, arithmetic `+ - * / %`, comparisons, +boolean logic, ternary/if-else, local variables, a fixed function set (`Math.*`, `System.Convert.*`, +`ScriptContext.Deadband`), and `ctx.SetVirtualTag` (vtag only). The Roslyn sandbox *deliberately* +also allows `System.Linq` (Sum/Average/Where + lambdas) and `System.Text.RegularExpressions` — the +**long tail** that a future interpreter would Roslyn-fallback rather than reimplement. VirtualTag +value scripts and ScriptedAlarm predicate scripts share the grammar (alarm = same expression → bool, +no `SetVirtualTag`). + +## Scope decision + +- **Phase 1 (build now):** A0 (globals isolation) + A (passthrough fast-path) + a **warn-only** + deploy guardrail. +- **Phase 2 (spec only, deferred):** C2 interpret-hybrid — documented here, built later only if/when + thousands of genuinely *complex* (non-passthrough) scripts justify it. +- Both consumers (VirtualTag + ScriptedAlarm) are in scope for A0; A is VirtualTag-passthrough only. + +--- + +## Phase 1 design + +### A0 — Isolate the globals type from Roslyn (the 11× win) + +Extract the **script-callable** types into a new lean assembly +**`ZB.MOM.WW.OtOpcUa.Core.Scripting.Abstractions`** that references **only `Core.Abstractions` +(+ Serilog)** and **never Roslyn**: + +- `ScriptContext` (base), `ScriptGlobals` (the globals wrapper), `VirtualTagContext`, + `AlarmPredicateContext`, and `ScriptContext.Deadband`. +- Leave the Roslyn users — `ScriptSandbox`, `ScriptEvaluator`, `RoslynVirtualTagEvaluator`, + `RoslynScriptedAlarmEvaluator`, `ForbiddenTypeAnalyzer`, `DependencyExtractor` — in + `Core.Scripting`; they now reference the lean assembly **downward** (lean ← never references the + Roslyn assembly). +- `Core.VirtualTags` / `Core.ScriptedAlarms` reference the lean assembly for the context types. + +Net: the `globalsType`'s transitive closure becomes `{Core.Scripting.Abstractions, Core.Abstractions, +Serilog}` — **no `Microsoft.CodeAnalysis.*`**. Pure dependency restructuring; **no behavior change**. + +**Structural note (the boundary):** `ScriptContext`/`VirtualTagContext`/`AlarmPredicateContext` must +have no Roslyn-referencing members. Confirm `ScriptContext` doesn't pull in `ScriptEvaluator`/ +`ScriptSandbox` types (it shouldn't — those are the *callers*). If any helper on the context needs a +Roslyn type, it stays behind in `Core.Scripting`. + +**Test:** the `tools/mem-probe` harness re-run shows per-script RSS in the ~1.66 MiB regime; the full +`Core.Scripting`, `Core.VirtualTags`, `Core.ScriptedAlarms`, and Host-integration script/alarm test +suites stay green (behaviour-preserving). + +### A — Passthrough fast-path (mirrors skip Roslyn entirely) + +In the evaluator (`RoslynVirtualTagEvaluator.Evaluate`), **before** the `_cache.GetOrAdd(expression, +Compile)` lookup, detect the trivial mirror shape and short-circuit: + +- Pattern (whitespace-tolerant): `^\s*return\s+ctx\.GetTag\(\s*"([^"]+)"\s*\)\.Value\s*;\s*$`. +- On match: return `dependencies[ref]` directly (the value is already passed into `Evaluate`) — **no + Roslyn, no cache entry, ~bytes.** Map a missing dep to the same `BadNodeIdUnknown`/no-change result + the Roslyn path would produce. +- Non-matches fall through to the existing Roslyn cache path unchanged. +- Downstream `DataType` coercion (in the actor) is unchanged — the fast-path returns the same raw + value the compiled script would have returned. + +Covers 100% of the mirror overlay (the 1036). Keep it a narrow, exact pattern so a near-miss safely +falls through to Roslyn rather than mis-evaluating. + +**Test:** passthrough returns the dep value with zero compilation (assert the cache stays empty); +a non-passthrough still compiles + works; a malformed near-match (`...Value + 1;`) falls through. + +### Warn-only guardrail + +In `AdminOperationsActor.HandleStartDeploymentAsync`, **after** the existing `DraftValidator` gate and +**non-blocking**: + +- `compiled = count(unique script sources that are NOT the passthrough shape)` (from the snapshot's + `Script` rows; reuse the A pattern to classify). +- `estMiB ≈ compiled × perScriptMiB` (configurable, default ~1.66 — the measured post-A0 cost). +- Emit a structured `_log.Warning("StartDeployment: {Compiled} scripts will compile (~{EstMiB} MiB + RSS/node); ensure node mem_limit covers it", …)` and append an advisory line to the **`Accepted`** + `StartDeploymentResult.Message`. +- **Never rejects.** Operator-visible, operator-decided. + +**Test:** a config with many distinct non-passthrough scripts logs the warning + still returns +`Accepted`; a passthrough-only config logs ~0 compiled. + +--- + +## Phase 2 design (deferred — spec only) + +### C2 — Interpret-hybrid (built later, if thousands of *complex* scripts appear) + +A bounded interpreter that replaces Roslyn for the surveyed grammar, with Roslyn retained as the +fallback for the long tail. Memory per interpreted script ≈ KB (a parsed AST), vs ~1.66 MiB (post-A0) +for a Roslyn-compiled one; scales to tens of thousands of complex scripts. + +- **Grammar (statement language, no loops/methods/classes):** literals (numeric/string/bool), the + context API (`ctx.GetTag(lit)` + `.Value`/`.StatusCode`/timestamps, `ctx.SetVirtualTag(lit, expr)`, + `ctx.Now`, `ctx.Logger`), explicit casts `(int)`/`(double)`/`(bool)`, arithmetic `+ - * / %`, + comparisons `< > <= >= == !=`, boolean `&& || !`, ternary `?:`, `if/else`, `var` local bindings, + a fixed allow-listed function set (`Math.*`, `System.Convert.*`, `ScriptContext.Deadband`), and + `return`. +- **Evaluator:** start with a tree-walking interpreter (lowest memory); optionally compile the AST to + a `System.Linq.Expressions` delegate (C2-compiled) for hot tags — still collectible `DynamicMethod`s, + ~KB, far below Roslyn. +- **Hybrid contract:** a classifier parses each script; if it's within the grammar → interpret; else + (LINQ, Regex, anything unrecognised) → Roslyn fallback (today's path). The deploy/warn guardrail + then counts only the *fallback* scripts. +- **Both consumers:** one engine serves VirtualTag value scripts (return value coerced to `DataType`) + and ScriptedAlarm predicates (return bool); the only difference is the return-type contract and that + alarm predicates reject `SetVirtualTag`. +- **Bonus:** interpreted scripts are a *hard* sandbox by construction — `ScriptSandbox` / + `ForbiddenTypeAnalyzer` (the curated metadata allow/deny machinery) only need guard the rare Roslyn + fallback path. +- **Risks (why deferred):** you own a small language (grammar/parser/semantics/error messages + tests); + C#-semantic-parity edge cases (int vs float division, overflow, null propagation); a classifier + + two engines to maintain. Worth it only at real complex-script scale. + +--- + +## Verification + +1. **Memory probe** (`tools/mem-probe/`, retained as the measurement artifact): re-run after A0 shows + the ~11× per-script drop. +2. **Live docker-dev proof (the real acceptance gate):** re-deploy the full **1036-vtag overlay** with + the A0+A build and confirm the deploy is **Accepted** *and* the central node **stays under its + `mem_limit`** through materialisation + value streaming (no `OOMKilled`). This is what proves the + outage is actually gone. +3. Unit tests: passthrough fast-path + warn-guardrail. +4. Existing `Core.Scripting` / `Core.VirtualTags` / `Core.ScriptedAlarms` / Host script+alarm suites + stay green (A0 is behaviour-preserving). + +## Sequencing & risk + +| Step | Risk | Notes | +|---|---|---| +| A0 (assembly split) | medium — touches assembly layout across `Core.Scripting`/`VirtualTags`/`ScriptedAlarms` + Host refs | behaviour-preserving; the measured 11× payoff; do first | +| A (passthrough) | low — narrow exact pattern in one evaluator method | additive; covers the mirror overlay | +| guardrail | low — non-blocking log + message | additive | +| C2 | — | **deferred**; spec only | + +A0 first (it moves types); A + guardrail are additive on top. The Phase-2 spec is documentation only. + +## Related context + +- `dotnet/roslyn#22219` — the upstream issue (globalsType-closure memory, mostly unmanaged, no fix). +- Measurement harness: `tools/mem-probe/` (this branch). +- Recovery already shipped this session: `docker-dev` `mem_limit 1g→2g` (`master` `89c07fc`) + + cleared the OOM-causing sealed deployments. The full-validator deploy gate + (`AdminOperationsActor` + `DraftValidator`) is where the warn-guardrail hooks in. diff --git a/tools/mem-probe/.gitignore b/tools/mem-probe/.gitignore new file mode 100644 index 00000000..659cdfc1 --- /dev/null +++ b/tools/mem-probe/.gitignore @@ -0,0 +1,2 @@ +*/bin/ +*/obj/ diff --git a/tools/mem-probe/LeanContext/LeanContext.csproj b/tools/mem-probe/LeanContext/LeanContext.csproj new file mode 100644 index 00000000..e189e7ad --- /dev/null +++ b/tools/mem-probe/LeanContext/LeanContext.csproj @@ -0,0 +1,18 @@ + + + + net10.0 + enable + enable + latest + + false + false + + + + + + + + diff --git a/tools/mem-probe/LeanContext/LeanCtx.cs b/tools/mem-probe/LeanContext/LeanCtx.cs new file mode 100644 index 00000000..b30c85ae --- /dev/null +++ b/tools/mem-probe/LeanContext/LeanCtx.cs @@ -0,0 +1,36 @@ +using ZB.MOM.WW.OtOpcUa.Core.Abstractions; + +namespace LeanContext; + +/// +/// LEAN globals type for the memory probe. Its transitive reference closure is only +/// {LeanContext, Core.Abstractions} — deliberately NO Roslyn — so the per-script cost +/// of the Roslyn reference-manager loading the globalsType's closure (dotnet/roslyn#22219) +/// can be measured against the heavy VirtualTagContext, whose closure pulls in +/// Microsoft.CodeAnalysis.CSharp.Scripting. +/// +/// returns the same type that +/// VirtualTagContext.GetTag returns, so the probe's script source +/// (ctx.GetTag("x").Value) is byte-identical for both modes — the ONLY +/// difference is which assembly closure the globalsType lives in. +/// +/// +public sealed class LeanCtx +{ + private readonly System.Collections.Generic.Dictionary _d = new(); + + public DataValueSnapshot GetTag(string p) => + _d.TryGetValue(p, out var v) ? v : new DataValueSnapshot(null, 0u, null, default); +} + +/// +/// LEAN analogue of the prod ScriptGlobals<TContext> wrapper: exposes a +/// named ctx property so the script source can be byte-identical to the heavy +/// path (ctx.GetTag(...).Value). Lives in the LeanContext assembly, so its +/// reference closure is {LeanContext, Core.Abstractions} — NO Roslyn. This is the A0 +/// "globals type in a lean assembly" treatment. +/// +public sealed class LeanGlobals +{ + public LeanCtx ctx { get; set; } = new(); +} diff --git a/tools/mem-probe/MemProbe/MemProbe.csproj b/tools/mem-probe/MemProbe/MemProbe.csproj new file mode 100644 index 00000000..7fbe73fd --- /dev/null +++ b/tools/mem-probe/MemProbe/MemProbe.csproj @@ -0,0 +1,23 @@ + + + + Exe + net10.0 + enable + enable + latest + false + false + + + + + + + + + + + + + diff --git a/tools/mem-probe/MemProbe/Program.cs b/tools/mem-probe/MemProbe/Program.cs new file mode 100644 index 00000000..9cbc085d --- /dev/null +++ b/tools/mem-probe/MemProbe/Program.cs @@ -0,0 +1,78 @@ +using Microsoft.CodeAnalysis.CSharp.Scripting; +using Microsoft.CodeAnalysis.Scripting; + +// Memory measurement probe for Roslyn C# scripting per dotnet/roslyn#22219. +// Compiles + RETAINS N distinct scripts (like the prod compiled-delegate cache) and +// measures the per-script working-set cost. The ONLY thing that varies between "heavy" +// and "lean" is the globalsType's assembly closure: +// heavy = VirtualTagContext (closure pulls in Roslyn via Core.Scripting) +// lean = LeanCtx (closure = {LeanContext, Core.Abstractions} only) + +static long Rss() => System.Diagnostics.Process.GetCurrentProcess().WorkingSet64; + +static void Settle() +{ + for (int i = 0; i < 3; i++) + { + GC.Collect(2, GCCollectionMode.Forced, true, true); + GC.WaitForPendingFinalizers(); + } +} + +var mode = args.Length > 0 ? args[0] : "heavy"; +int N = args.Length > 1 && int.TryParse(args[1], out var n) ? n : 50; + +// The globalsType is what Roslyn's reference manager loads the transitive closure of +// (dotnet/roslyn#22219). We mirror production's choice: prod uses the WRAPPER +// ScriptGlobals (which exposes the named `ctx` property), NOT the raw context. +// heavy = ScriptGlobals -> wrapper lives in Core.Scripting AND the +// generic arg VirtualTagContext lives in Core.VirtualTags; both -> Roslyn. +// lean = LeanGlobals -> closure {LeanContext, Core.Abstractions}, +// NO Roslyn. This is the A0 "globals type in a lean assembly" treatment. +var globalsType = mode == "lean" + ? typeof(LeanContext.LeanGlobals) + : typeof(ZB.MOM.WW.OtOpcUa.Core.Scripting.ScriptGlobals); + +// The script reads ctx.GetTag("x").Value. We must reference: the globalsType's own +// assembly, the assemblies of its generic type arguments (so `ctx`'s property type +// resolves), and Core.Abstractions (DataValueSnapshot, the return type of GetTag). +// References are minimal + identical in spirit for both modes; the real difference is +// the transitive UNMANAGED closure of the globalsType's assembly that Roslyn's reference +// manager loads per compilation (the #22219 effect). +var snapshotAssembly = typeof(ZB.MOM.WW.OtOpcUa.Core.Abstractions.DataValueSnapshot).Assembly; +var refAssemblies = new System.Collections.Generic.HashSet +{ + globalsType.Assembly, + snapshotAssembly, +}; +foreach (var ga in globalsType.GetGenericArguments()) + refAssemblies.Add(ga.Assembly); +var opts = ScriptOptions.Default + .WithReferences(refAssemblies) + .WithImports(); + +// Warm up Roslyn once (compile 1 throwaway) so the baseline excludes one-time Roslyn init. +_ = CSharpScript.Create("return 0;", opts, globalsType).GetCompilation().GetDiagnostics(); +Settle(); + +long baseRss = Rss(); +long baseGc = GC.GetTotalMemory(true); + +var held = new System.Collections.Generic.List(N); +for (int i = 0; i < N; i++) +{ + var src = $"return ctx.GetTag(\"ref_{i}\").Value;"; + var script = CSharpScript.Create(src, opts, globalsType); + script.Compile(); // force compilation / emit + held.Add(script.CreateDelegate()); // retain the compiled delegate (like the prod cache) +} + +Settle(); +long afterRss = Rss(); +long afterGc = GC.GetTotalMemory(true); +GC.KeepAlive(held); + +Console.WriteLine($"MODE={mode} N={N}"); +Console.WriteLine($" baseline RSS={baseRss / 1048576.0:F1} MiB managed={baseGc / 1048576.0:F1} MiB"); +Console.WriteLine($" afterN RSS={afterRss / 1048576.0:F1} MiB managed={afterGc / 1048576.0:F1} MiB"); +Console.WriteLine($" PER-SCRIPT: RSS={(afterRss - baseRss) / 1048576.0 / N:F2} MiB/script managed={(afterGc - baseGc) / 1048576.0 / N:F2} MiB/script");