docs(design): VirtualTag/script memory scalability (A0+A+guardrail; C2 deferred) + measurement harness

This commit is contained in:
Joseph Doherty
2026-06-07 14:55:28 -04:00
parent 89c07fc382
commit 321d57938f
6 changed files with 337 additions and 0 deletions
@@ -0,0 +1,180 @@
# VirtualTag / Script Memory Scalability — Design (2026-06-07)
## Problem
Deploying the Northwind company overlay (1036 `VirtualTag`s, each a one-line mirror script
`return ctx.GetTag("…").Value;`) OOM-killed the central nodes — even with a 4 GiB container limit,
on a 15.6 GiB Docker VM. The node materialised the address space and spawned 1036 `VirtualTagActor`s,
then died as their scripts compiled.
### Root cause (measured)
The per-script cost of Roslyn C# scripting is dominated by Roslyn's reference manager materialising
the **transitive reference closure of the assembly that contains the script's `globalsType`** — and
it is almost entirely **unmanaged** memory (a managed-heap snapshot barely moves). This is the
long-standing, unfixed `dotnet/roslyn#22219` ("Backlog"): the reporter measured ~50 MiB/script and
OOM at ~39 scripts; the confirmed mitigation in that thread is "move the globalsType to a lean
assembly."
Our globals type reaches Roslyn: `ScriptGlobals<VirtualTagContext>``Core.VirtualTags`
`Core.Scripting``Microsoft.CodeAnalysis.CSharp.Scripting`. So every compile pays for the whole
Roslyn metadata closure.
### Measurement (probe in `tools/mem-probe/`, Roslyn 4.12.0, 50 retained scripts, 5 runs each)
| globals closure | per-script RSS | per-script managed |
|---|---|---|
| **Heavy**`VirtualTagContext` (reaches Roslyn) — *today* | **~18.2 MiB** (±25% noise) | 0.18 MiB |
| **Lean** — context in an Abstractions-only assembly (no Roslyn) | **~1.66 MiB** | 0.05 MiB |
**~11× reduction, ~99% unmanaged.** Real cost today ≈ **18 MiB/script** → 1036 vtags ≈ **~18 GiB**
(explains the instant OOM even at 4 GiB). The earlier "~3.5 MiB" guess was ~5× too low.
### Corpus survey (decides the Phase-2 grammar)
Real VirtualTag and ScriptedAlarm scripts use a **small bounded statement grammar**: tag reads
(`ctx.GetTag("lit").Value` / `.StatusCode`), explicit casts, arithmetic `+ - * / %`, comparisons,
boolean logic, ternary/if-else, local variables, a fixed function set (`Math.*`, `System.Convert.*`,
`ScriptContext.Deadband`), and `ctx.SetVirtualTag` (vtag only). The Roslyn sandbox *deliberately*
also allows `System.Linq` (Sum/Average/Where + lambdas) and `System.Text.RegularExpressions` — the
**long tail** that a future interpreter would Roslyn-fallback rather than reimplement. VirtualTag
value scripts and ScriptedAlarm predicate scripts share the grammar (alarm = same expression → bool,
no `SetVirtualTag`).
## Scope decision
- **Phase 1 (build now):** A0 (globals isolation) + A (passthrough fast-path) + a **warn-only**
deploy guardrail.
- **Phase 2 (spec only, deferred):** C2 interpret-hybrid — documented here, built later only if/when
thousands of genuinely *complex* (non-passthrough) scripts justify it.
- Both consumers (VirtualTag + ScriptedAlarm) are in scope for A0; A is VirtualTag-passthrough only.
---
## Phase 1 design
### A0 — Isolate the globals type from Roslyn (the 11× win)
Extract the **script-callable** types into a new lean assembly
**`ZB.MOM.WW.OtOpcUa.Core.Scripting.Abstractions`** that references **only `Core.Abstractions`
(+ Serilog)** and **never Roslyn**:
- `ScriptContext` (base), `ScriptGlobals<T>` (the globals wrapper), `VirtualTagContext`,
`AlarmPredicateContext`, and `ScriptContext.Deadband`.
- Leave the Roslyn users — `ScriptSandbox`, `ScriptEvaluator`, `RoslynVirtualTagEvaluator`,
`RoslynScriptedAlarmEvaluator`, `ForbiddenTypeAnalyzer`, `DependencyExtractor` — in
`Core.Scripting`; they now reference the lean assembly **downward** (lean ← never references the
Roslyn assembly).
- `Core.VirtualTags` / `Core.ScriptedAlarms` reference the lean assembly for the context types.
Net: the `globalsType`'s transitive closure becomes `{Core.Scripting.Abstractions, Core.Abstractions,
Serilog}`**no `Microsoft.CodeAnalysis.*`**. Pure dependency restructuring; **no behavior change**.
**Structural note (the boundary):** `ScriptContext`/`VirtualTagContext`/`AlarmPredicateContext` must
have no Roslyn-referencing members. Confirm `ScriptContext` doesn't pull in `ScriptEvaluator`/
`ScriptSandbox` types (it shouldn't — those are the *callers*). If any helper on the context needs a
Roslyn type, it stays behind in `Core.Scripting`.
**Test:** the `tools/mem-probe` harness re-run shows per-script RSS in the ~1.66 MiB regime; the full
`Core.Scripting`, `Core.VirtualTags`, `Core.ScriptedAlarms`, and Host-integration script/alarm test
suites stay green (behaviour-preserving).
### A — Passthrough fast-path (mirrors skip Roslyn entirely)
In the evaluator (`RoslynVirtualTagEvaluator.Evaluate`), **before** the `_cache.GetOrAdd(expression,
Compile)` lookup, detect the trivial mirror shape and short-circuit:
- Pattern (whitespace-tolerant): `^\s*return\s+ctx\.GetTag\(\s*"([^"]+)"\s*\)\.Value\s*;\s*$`.
- On match: return `dependencies[ref]` directly (the value is already passed into `Evaluate`) — **no
Roslyn, no cache entry, ~bytes.** Map a missing dep to the same `BadNodeIdUnknown`/no-change result
the Roslyn path would produce.
- Non-matches fall through to the existing Roslyn cache path unchanged.
- Downstream `DataType` coercion (in the actor) is unchanged — the fast-path returns the same raw
value the compiled script would have returned.
Covers 100% of the mirror overlay (the 1036). Keep it a narrow, exact pattern so a near-miss safely
falls through to Roslyn rather than mis-evaluating.
**Test:** passthrough returns the dep value with zero compilation (assert the cache stays empty);
a non-passthrough still compiles + works; a malformed near-match (`...Value + 1;`) falls through.
### Warn-only guardrail
In `AdminOperationsActor.HandleStartDeploymentAsync`, **after** the existing `DraftValidator` gate and
**non-blocking**:
- `compiled = count(unique script sources that are NOT the passthrough shape)` (from the snapshot's
`Script` rows; reuse the A pattern to classify).
- `estMiB ≈ compiled × perScriptMiB` (configurable, default ~1.66 — the measured post-A0 cost).
- Emit a structured `_log.Warning("StartDeployment: {Compiled} scripts will compile (~{EstMiB} MiB
RSS/node); ensure node mem_limit covers it", …)` and append an advisory line to the **`Accepted`**
`StartDeploymentResult.Message`.
- **Never rejects.** Operator-visible, operator-decided.
**Test:** a config with many distinct non-passthrough scripts logs the warning + still returns
`Accepted`; a passthrough-only config logs ~0 compiled.
---
## Phase 2 design (deferred — spec only)
### C2 — Interpret-hybrid (built later, if thousands of *complex* scripts appear)
A bounded interpreter that replaces Roslyn for the surveyed grammar, with Roslyn retained as the
fallback for the long tail. Memory per interpreted script ≈ KB (a parsed AST), vs ~1.66 MiB (post-A0)
for a Roslyn-compiled one; scales to tens of thousands of complex scripts.
- **Grammar (statement language, no loops/methods/classes):** literals (numeric/string/bool), the
context API (`ctx.GetTag(lit)` + `.Value`/`.StatusCode`/timestamps, `ctx.SetVirtualTag(lit, expr)`,
`ctx.Now`, `ctx.Logger`), explicit casts `(int)`/`(double)`/`(bool)`, arithmetic `+ - * / %`,
comparisons `< > <= >= == !=`, boolean `&& || !`, ternary `?:`, `if/else`, `var` local bindings,
a fixed allow-listed function set (`Math.*`, `System.Convert.*`, `ScriptContext.Deadband`), and
`return`.
- **Evaluator:** start with a tree-walking interpreter (lowest memory); optionally compile the AST to
a `System.Linq.Expressions` delegate (C2-compiled) for hot tags — still collectible `DynamicMethod`s,
~KB, far below Roslyn.
- **Hybrid contract:** a classifier parses each script; if it's within the grammar → interpret; else
(LINQ, Regex, anything unrecognised) → Roslyn fallback (today's path). The deploy/warn guardrail
then counts only the *fallback* scripts.
- **Both consumers:** one engine serves VirtualTag value scripts (return value coerced to `DataType`)
and ScriptedAlarm predicates (return bool); the only difference is the return-type contract and that
alarm predicates reject `SetVirtualTag`.
- **Bonus:** interpreted scripts are a *hard* sandbox by construction — `ScriptSandbox` /
`ForbiddenTypeAnalyzer` (the curated metadata allow/deny machinery) only need guard the rare Roslyn
fallback path.
- **Risks (why deferred):** you own a small language (grammar/parser/semantics/error messages + tests);
C#-semantic-parity edge cases (int vs float division, overflow, null propagation); a classifier +
two engines to maintain. Worth it only at real complex-script scale.
---
## Verification
1. **Memory probe** (`tools/mem-probe/`, retained as the measurement artifact): re-run after A0 shows
the ~11× per-script drop.
2. **Live docker-dev proof (the real acceptance gate):** re-deploy the full **1036-vtag overlay** with
the A0+A build and confirm the deploy is **Accepted** *and* the central node **stays under its
`mem_limit`** through materialisation + value streaming (no `OOMKilled`). This is what proves the
outage is actually gone.
3. Unit tests: passthrough fast-path + warn-guardrail.
4. Existing `Core.Scripting` / `Core.VirtualTags` / `Core.ScriptedAlarms` / Host script+alarm suites
stay green (A0 is behaviour-preserving).
## Sequencing & risk
| Step | Risk | Notes |
|---|---|---|
| A0 (assembly split) | medium — touches assembly layout across `Core.Scripting`/`VirtualTags`/`ScriptedAlarms` + Host refs | behaviour-preserving; the measured 11× payoff; do first |
| A (passthrough) | low — narrow exact pattern in one evaluator method | additive; covers the mirror overlay |
| guardrail | low — non-blocking log + message | additive |
| C2 | — | **deferred**; spec only |
A0 first (it moves types); A + guardrail are additive on top. The Phase-2 spec is documentation only.
## Related context
- `dotnet/roslyn#22219` — the upstream issue (globalsType-closure memory, mostly unmanaged, no fix).
- Measurement harness: `tools/mem-probe/` (this branch).
- Recovery already shipped this session: `docker-dev` `mem_limit 1g→2g` (`master` `89c07fc`) +
cleared the OOM-causing sealed deployments. The full-validator deploy gate
(`AdminOperationsActor` + `DraftValidator`) is where the warn-guardrail hooks in.
+2
View File
@@ -0,0 +1,2 @@
*/bin/
*/obj/
@@ -0,0 +1,18 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<LangVersion>latest</LangVersion>
<!-- Throwaway memory probe: keep build noise low. -->
<TreatWarningsAsErrors>false</TreatWarningsAsErrors>
<GenerateDocumentationFile>false</GenerateDocumentationFile>
</PropertyGroup>
<ItemGroup>
<!-- Closure of THIS assembly = {LeanContext, Core.Abstractions}. No Roslyn. -->
<ProjectReference Include="..\..\..\src\Core\ZB.MOM.WW.OtOpcUa.Core.Abstractions\ZB.MOM.WW.OtOpcUa.Core.Abstractions.csproj" />
</ItemGroup>
</Project>
+36
View File
@@ -0,0 +1,36 @@
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace LeanContext;
/// <summary>
/// LEAN globals type for the memory probe. Its transitive reference closure is only
/// {LeanContext, Core.Abstractions} — deliberately NO Roslyn — so the per-script cost
/// of the Roslyn reference-manager loading the globalsType's closure (dotnet/roslyn#22219)
/// can be measured against the heavy <c>VirtualTagContext</c>, whose closure pulls in
/// Microsoft.CodeAnalysis.CSharp.Scripting.
/// <para>
/// <see cref="GetTag"/> returns the same <see cref="DataValueSnapshot"/> type that
/// <c>VirtualTagContext.GetTag</c> returns, so the probe's script source
/// (<c>ctx.GetTag("x").Value</c>) is byte-identical for both modes — the ONLY
/// difference is which assembly closure the globalsType lives in.
/// </para>
/// </summary>
public sealed class LeanCtx
{
private readonly System.Collections.Generic.Dictionary<string, DataValueSnapshot> _d = new();
public DataValueSnapshot GetTag(string p) =>
_d.TryGetValue(p, out var v) ? v : new DataValueSnapshot(null, 0u, null, default);
}
/// <summary>
/// LEAN analogue of the prod <c>ScriptGlobals&lt;TContext&gt;</c> wrapper: exposes a
/// named <c>ctx</c> property so the script source can be byte-identical to the heavy
/// path (<c>ctx.GetTag(...).Value</c>). Lives in the LeanContext assembly, so its
/// reference closure is {LeanContext, Core.Abstractions} — NO Roslyn. This is the A0
/// "globals type in a lean assembly" treatment.
/// </summary>
public sealed class LeanGlobals
{
public LeanCtx ctx { get; set; } = new();
}
+23
View File
@@ -0,0 +1,23 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net10.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<LangVersion>latest</LangVersion>
<TreatWarningsAsErrors>false</TreatWarningsAsErrors>
<GenerateDocumentationFile>false</GenerateDocumentationFile>
</PropertyGroup>
<ItemGroup>
<!-- Same Roslyn version the repo pins (Directory.Packages.props => 4.12.0, CPM). -->
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Scripting" />
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\LeanContext\LeanContext.csproj" />
<ProjectReference Include="..\..\..\src\Core\ZB.MOM.WW.OtOpcUa.Core.VirtualTags\ZB.MOM.WW.OtOpcUa.Core.VirtualTags.csproj" />
</ItemGroup>
</Project>
+78
View File
@@ -0,0 +1,78 @@
using Microsoft.CodeAnalysis.CSharp.Scripting;
using Microsoft.CodeAnalysis.Scripting;
// Memory measurement probe for Roslyn C# scripting per dotnet/roslyn#22219.
// Compiles + RETAINS N distinct scripts (like the prod compiled-delegate cache) and
// measures the per-script working-set cost. The ONLY thing that varies between "heavy"
// and "lean" is the globalsType's assembly closure:
// heavy = VirtualTagContext (closure pulls in Roslyn via Core.Scripting)
// lean = LeanCtx (closure = {LeanContext, Core.Abstractions} only)
static long Rss() => System.Diagnostics.Process.GetCurrentProcess().WorkingSet64;
static void Settle()
{
for (int i = 0; i < 3; i++)
{
GC.Collect(2, GCCollectionMode.Forced, true, true);
GC.WaitForPendingFinalizers();
}
}
var mode = args.Length > 0 ? args[0] : "heavy";
int N = args.Length > 1 && int.TryParse(args[1], out var n) ? n : 50;
// The globalsType is what Roslyn's reference manager loads the transitive closure of
// (dotnet/roslyn#22219). We mirror production's choice: prod uses the WRAPPER
// ScriptGlobals<TContext> (which exposes the named `ctx` property), NOT the raw context.
// heavy = ScriptGlobals<VirtualTagContext> -> wrapper lives in Core.Scripting AND the
// generic arg VirtualTagContext lives in Core.VirtualTags; both -> Roslyn.
// lean = LeanGlobals -> closure {LeanContext, Core.Abstractions},
// NO Roslyn. This is the A0 "globals type in a lean assembly" treatment.
var globalsType = mode == "lean"
? typeof(LeanContext.LeanGlobals)
: typeof(ZB.MOM.WW.OtOpcUa.Core.Scripting.ScriptGlobals<ZB.MOM.WW.OtOpcUa.Core.VirtualTags.VirtualTagContext>);
// The script reads ctx.GetTag("x").Value. We must reference: the globalsType's own
// assembly, the assemblies of its generic type arguments (so `ctx`'s property type
// resolves), and Core.Abstractions (DataValueSnapshot, the return type of GetTag).
// References are minimal + identical in spirit for both modes; the real difference is
// the transitive UNMANAGED closure of the globalsType's assembly that Roslyn's reference
// manager loads per compilation (the #22219 effect).
var snapshotAssembly = typeof(ZB.MOM.WW.OtOpcUa.Core.Abstractions.DataValueSnapshot).Assembly;
var refAssemblies = new System.Collections.Generic.HashSet<System.Reflection.Assembly>
{
globalsType.Assembly,
snapshotAssembly,
};
foreach (var ga in globalsType.GetGenericArguments())
refAssemblies.Add(ga.Assembly);
var opts = ScriptOptions.Default
.WithReferences(refAssemblies)
.WithImports();
// Warm up Roslyn once (compile 1 throwaway) so the baseline excludes one-time Roslyn init.
_ = CSharpScript.Create<object>("return 0;", opts, globalsType).GetCompilation().GetDiagnostics();
Settle();
long baseRss = Rss();
long baseGc = GC.GetTotalMemory(true);
var held = new System.Collections.Generic.List<object>(N);
for (int i = 0; i < N; i++)
{
var src = $"return ctx.GetTag(\"ref_{i}\").Value;";
var script = CSharpScript.Create<object>(src, opts, globalsType);
script.Compile(); // force compilation / emit
held.Add(script.CreateDelegate()); // retain the compiled delegate (like the prod cache)
}
Settle();
long afterRss = Rss();
long afterGc = GC.GetTotalMemory(true);
GC.KeepAlive(held);
Console.WriteLine($"MODE={mode} N={N}");
Console.WriteLine($" baseline RSS={baseRss / 1048576.0:F1} MiB managed={baseGc / 1048576.0:F1} MiB");
Console.WriteLine($" afterN RSS={afterRss / 1048576.0:F1} MiB managed={afterGc / 1048576.0:F1} MiB");
Console.WriteLine($" PER-SCRIPT: RSS={(afterRss - baseRss) / 1048576.0 / N:F2} MiB/script managed={(afterGc - baseGc) / 1048576.0 / N:F2} MiB/script");