docs(design): VirtualTag/script memory scalability (A0+A+guardrail; C2 deferred) + measurement harness
This commit is contained in:
@@ -0,0 +1,180 @@
|
||||
# VirtualTag / Script Memory Scalability — Design (2026-06-07)
|
||||
|
||||
## Problem
|
||||
|
||||
Deploying the Northwind company overlay (1036 `VirtualTag`s, each a one-line mirror script
|
||||
`return ctx.GetTag("…").Value;`) OOM-killed the central nodes — even with a 4 GiB container limit,
|
||||
on a 15.6 GiB Docker VM. The node materialised the address space and spawned 1036 `VirtualTagActor`s,
|
||||
then died as their scripts compiled.
|
||||
|
||||
### Root cause (measured)
|
||||
|
||||
The per-script cost of Roslyn C# scripting is dominated by Roslyn's reference manager materialising
|
||||
the **transitive reference closure of the assembly that contains the script's `globalsType`** — and
|
||||
it is almost entirely **unmanaged** memory (a managed-heap snapshot barely moves). This is the
|
||||
long-standing, unfixed `dotnet/roslyn#22219` ("Backlog"): the reporter measured ~50 MiB/script and
|
||||
OOM at ~39 scripts; the confirmed mitigation in that thread is "move the globalsType to a lean
|
||||
assembly."
|
||||
|
||||
Our globals type reaches Roslyn: `ScriptGlobals<VirtualTagContext>` → `Core.VirtualTags` →
|
||||
`Core.Scripting` → `Microsoft.CodeAnalysis.CSharp.Scripting`. So every compile pays for the whole
|
||||
Roslyn metadata closure.
|
||||
|
||||
### Measurement (probe in `tools/mem-probe/`, Roslyn 4.12.0, 50 retained scripts, 5 runs each)
|
||||
|
||||
| globals closure | per-script RSS | per-script managed |
|
||||
|---|---|---|
|
||||
| **Heavy** — `VirtualTagContext` (reaches Roslyn) — *today* | **~18.2 MiB** (±25% noise) | 0.18 MiB |
|
||||
| **Lean** — context in an Abstractions-only assembly (no Roslyn) | **~1.66 MiB** | 0.05 MiB |
|
||||
|
||||
→ **~11× reduction, ~99% unmanaged.** Real cost today ≈ **18 MiB/script** → 1036 vtags ≈ **~18 GiB**
|
||||
(explains the instant OOM even at 4 GiB). The earlier "~3.5 MiB" guess was ~5× too low.
|
||||
|
||||
### Corpus survey (decides the Phase-2 grammar)
|
||||
|
||||
Real VirtualTag and ScriptedAlarm scripts use a **small bounded statement grammar**: tag reads
|
||||
(`ctx.GetTag("lit").Value` / `.StatusCode`), explicit casts, arithmetic `+ - * / %`, comparisons,
|
||||
boolean logic, ternary/if-else, local variables, a fixed function set (`Math.*`, `System.Convert.*`,
|
||||
`ScriptContext.Deadband`), and `ctx.SetVirtualTag` (vtag only). The Roslyn sandbox *deliberately*
|
||||
also allows `System.Linq` (Sum/Average/Where + lambdas) and `System.Text.RegularExpressions` — the
|
||||
**long tail** that a future interpreter would Roslyn-fallback rather than reimplement. VirtualTag
|
||||
value scripts and ScriptedAlarm predicate scripts share the grammar (alarm = same expression → bool,
|
||||
no `SetVirtualTag`).
|
||||
|
||||
## Scope decision
|
||||
|
||||
- **Phase 1 (build now):** A0 (globals isolation) + A (passthrough fast-path) + a **warn-only**
|
||||
deploy guardrail.
|
||||
- **Phase 2 (spec only, deferred):** C2 interpret-hybrid — documented here, built later only if/when
|
||||
thousands of genuinely *complex* (non-passthrough) scripts justify it.
|
||||
- Both consumers (VirtualTag + ScriptedAlarm) are in scope for A0; A is VirtualTag-passthrough only.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 design
|
||||
|
||||
### A0 — Isolate the globals type from Roslyn (the 11× win)
|
||||
|
||||
Extract the **script-callable** types into a new lean assembly
|
||||
**`ZB.MOM.WW.OtOpcUa.Core.Scripting.Abstractions`** that references **only `Core.Abstractions`
|
||||
(+ Serilog)** and **never Roslyn**:
|
||||
|
||||
- `ScriptContext` (base), `ScriptGlobals<T>` (the globals wrapper), `VirtualTagContext`,
|
||||
`AlarmPredicateContext`, and `ScriptContext.Deadband`.
|
||||
- Leave the Roslyn users — `ScriptSandbox`, `ScriptEvaluator`, `RoslynVirtualTagEvaluator`,
|
||||
`RoslynScriptedAlarmEvaluator`, `ForbiddenTypeAnalyzer`, `DependencyExtractor` — in
|
||||
`Core.Scripting`; they now reference the lean assembly **downward** (lean ← never references the
|
||||
Roslyn assembly).
|
||||
- `Core.VirtualTags` / `Core.ScriptedAlarms` reference the lean assembly for the context types.
|
||||
|
||||
Net: the `globalsType`'s transitive closure becomes `{Core.Scripting.Abstractions, Core.Abstractions,
|
||||
Serilog}` — **no `Microsoft.CodeAnalysis.*`**. Pure dependency restructuring; **no behavior change**.
|
||||
|
||||
**Structural note (the boundary):** `ScriptContext`/`VirtualTagContext`/`AlarmPredicateContext` must
|
||||
have no Roslyn-referencing members. Confirm `ScriptContext` doesn't pull in `ScriptEvaluator`/
|
||||
`ScriptSandbox` types (it shouldn't — those are the *callers*). If any helper on the context needs a
|
||||
Roslyn type, it stays behind in `Core.Scripting`.
|
||||
|
||||
**Test:** the `tools/mem-probe` harness re-run shows per-script RSS in the ~1.66 MiB regime; the full
|
||||
`Core.Scripting`, `Core.VirtualTags`, `Core.ScriptedAlarms`, and Host-integration script/alarm test
|
||||
suites stay green (behaviour-preserving).
|
||||
|
||||
### A — Passthrough fast-path (mirrors skip Roslyn entirely)
|
||||
|
||||
In the evaluator (`RoslynVirtualTagEvaluator.Evaluate`), **before** the `_cache.GetOrAdd(expression,
|
||||
Compile)` lookup, detect the trivial mirror shape and short-circuit:
|
||||
|
||||
- Pattern (whitespace-tolerant): `^\s*return\s+ctx\.GetTag\(\s*"([^"]+)"\s*\)\.Value\s*;\s*$`.
|
||||
- On match: return `dependencies[ref]` directly (the value is already passed into `Evaluate`) — **no
|
||||
Roslyn, no cache entry, ~bytes.** Map a missing dep to the same `BadNodeIdUnknown`/no-change result
|
||||
the Roslyn path would produce.
|
||||
- Non-matches fall through to the existing Roslyn cache path unchanged.
|
||||
- Downstream `DataType` coercion (in the actor) is unchanged — the fast-path returns the same raw
|
||||
value the compiled script would have returned.
|
||||
|
||||
Covers 100% of the mirror overlay (the 1036). Keep it a narrow, exact pattern so a near-miss safely
|
||||
falls through to Roslyn rather than mis-evaluating.
|
||||
|
||||
**Test:** passthrough returns the dep value with zero compilation (assert the cache stays empty);
|
||||
a non-passthrough still compiles + works; a malformed near-match (`...Value + 1;`) falls through.
|
||||
|
||||
### Warn-only guardrail
|
||||
|
||||
In `AdminOperationsActor.HandleStartDeploymentAsync`, **after** the existing `DraftValidator` gate and
|
||||
**non-blocking**:
|
||||
|
||||
- `compiled = count(unique script sources that are NOT the passthrough shape)` (from the snapshot's
|
||||
`Script` rows; reuse the A pattern to classify).
|
||||
- `estMiB ≈ compiled × perScriptMiB` (configurable, default ~1.66 — the measured post-A0 cost).
|
||||
- Emit a structured `_log.Warning("StartDeployment: {Compiled} scripts will compile (~{EstMiB} MiB
|
||||
RSS/node); ensure node mem_limit covers it", …)` and append an advisory line to the **`Accepted`**
|
||||
`StartDeploymentResult.Message`.
|
||||
- **Never rejects.** Operator-visible, operator-decided.
|
||||
|
||||
**Test:** a config with many distinct non-passthrough scripts logs the warning + still returns
|
||||
`Accepted`; a passthrough-only config logs ~0 compiled.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 design (deferred — spec only)
|
||||
|
||||
### C2 — Interpret-hybrid (built later, if thousands of *complex* scripts appear)
|
||||
|
||||
A bounded interpreter that replaces Roslyn for the surveyed grammar, with Roslyn retained as the
|
||||
fallback for the long tail. Memory per interpreted script ≈ KB (a parsed AST), vs ~1.66 MiB (post-A0)
|
||||
for a Roslyn-compiled one; scales to tens of thousands of complex scripts.
|
||||
|
||||
- **Grammar (statement language, no loops/methods/classes):** literals (numeric/string/bool), the
|
||||
context API (`ctx.GetTag(lit)` + `.Value`/`.StatusCode`/timestamps, `ctx.SetVirtualTag(lit, expr)`,
|
||||
`ctx.Now`, `ctx.Logger`), explicit casts `(int)`/`(double)`/`(bool)`, arithmetic `+ - * / %`,
|
||||
comparisons `< > <= >= == !=`, boolean `&& || !`, ternary `?:`, `if/else`, `var` local bindings,
|
||||
a fixed allow-listed function set (`Math.*`, `System.Convert.*`, `ScriptContext.Deadband`), and
|
||||
`return`.
|
||||
- **Evaluator:** start with a tree-walking interpreter (lowest memory); optionally compile the AST to
|
||||
a `System.Linq.Expressions` delegate (C2-compiled) for hot tags — still collectible `DynamicMethod`s,
|
||||
~KB, far below Roslyn.
|
||||
- **Hybrid contract:** a classifier parses each script; if it's within the grammar → interpret; else
|
||||
(LINQ, Regex, anything unrecognised) → Roslyn fallback (today's path). The deploy/warn guardrail
|
||||
then counts only the *fallback* scripts.
|
||||
- **Both consumers:** one engine serves VirtualTag value scripts (return value coerced to `DataType`)
|
||||
and ScriptedAlarm predicates (return bool); the only difference is the return-type contract and that
|
||||
alarm predicates reject `SetVirtualTag`.
|
||||
- **Bonus:** interpreted scripts are a *hard* sandbox by construction — `ScriptSandbox` /
|
||||
`ForbiddenTypeAnalyzer` (the curated metadata allow/deny machinery) only need guard the rare Roslyn
|
||||
fallback path.
|
||||
- **Risks (why deferred):** you own a small language (grammar/parser/semantics/error messages + tests);
|
||||
C#-semantic-parity edge cases (int vs float division, overflow, null propagation); a classifier +
|
||||
two engines to maintain. Worth it only at real complex-script scale.
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
1. **Memory probe** (`tools/mem-probe/`, retained as the measurement artifact): re-run after A0 shows
|
||||
the ~11× per-script drop.
|
||||
2. **Live docker-dev proof (the real acceptance gate):** re-deploy the full **1036-vtag overlay** with
|
||||
the A0+A build and confirm the deploy is **Accepted** *and* the central node **stays under its
|
||||
`mem_limit`** through materialisation + value streaming (no `OOMKilled`). This is what proves the
|
||||
outage is actually gone.
|
||||
3. Unit tests: passthrough fast-path + warn-guardrail.
|
||||
4. Existing `Core.Scripting` / `Core.VirtualTags` / `Core.ScriptedAlarms` / Host script+alarm suites
|
||||
stay green (A0 is behaviour-preserving).
|
||||
|
||||
## Sequencing & risk
|
||||
|
||||
| Step | Risk | Notes |
|
||||
|---|---|---|
|
||||
| A0 (assembly split) | medium — touches assembly layout across `Core.Scripting`/`VirtualTags`/`ScriptedAlarms` + Host refs | behaviour-preserving; the measured 11× payoff; do first |
|
||||
| A (passthrough) | low — narrow exact pattern in one evaluator method | additive; covers the mirror overlay |
|
||||
| guardrail | low — non-blocking log + message | additive |
|
||||
| C2 | — | **deferred**; spec only |
|
||||
|
||||
A0 first (it moves types); A + guardrail are additive on top. The Phase-2 spec is documentation only.
|
||||
|
||||
## Related context
|
||||
|
||||
- `dotnet/roslyn#22219` — the upstream issue (globalsType-closure memory, mostly unmanaged, no fix).
|
||||
- Measurement harness: `tools/mem-probe/` (this branch).
|
||||
- Recovery already shipped this session: `docker-dev` `mem_limit 1g→2g` (`master` `89c07fc`) +
|
||||
cleared the OOM-causing sealed deployments. The full-validator deploy gate
|
||||
(`AdminOperationsActor` + `DraftValidator`) is where the warn-guardrail hooks in.
|
||||
@@ -0,0 +1,2 @@
|
||||
*/bin/
|
||||
*/obj/
|
||||
@@ -0,0 +1,18 @@
|
||||
<Project Sdk="Microsoft.NET.Sdk">
|
||||
|
||||
<PropertyGroup>
|
||||
<TargetFramework>net10.0</TargetFramework>
|
||||
<ImplicitUsings>enable</ImplicitUsings>
|
||||
<Nullable>enable</Nullable>
|
||||
<LangVersion>latest</LangVersion>
|
||||
<!-- Throwaway memory probe: keep build noise low. -->
|
||||
<TreatWarningsAsErrors>false</TreatWarningsAsErrors>
|
||||
<GenerateDocumentationFile>false</GenerateDocumentationFile>
|
||||
</PropertyGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<!-- Closure of THIS assembly = {LeanContext, Core.Abstractions}. No Roslyn. -->
|
||||
<ProjectReference Include="..\..\..\src\Core\ZB.MOM.WW.OtOpcUa.Core.Abstractions\ZB.MOM.WW.OtOpcUa.Core.Abstractions.csproj" />
|
||||
</ItemGroup>
|
||||
|
||||
</Project>
|
||||
@@ -0,0 +1,36 @@
|
||||
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
|
||||
|
||||
namespace LeanContext;
|
||||
|
||||
/// <summary>
|
||||
/// LEAN globals type for the memory probe. Its transitive reference closure is only
|
||||
/// {LeanContext, Core.Abstractions} — deliberately NO Roslyn — so the per-script cost
|
||||
/// of the Roslyn reference-manager loading the globalsType's closure (dotnet/roslyn#22219)
|
||||
/// can be measured against the heavy <c>VirtualTagContext</c>, whose closure pulls in
|
||||
/// Microsoft.CodeAnalysis.CSharp.Scripting.
|
||||
/// <para>
|
||||
/// <see cref="GetTag"/> returns the same <see cref="DataValueSnapshot"/> type that
|
||||
/// <c>VirtualTagContext.GetTag</c> returns, so the probe's script source
|
||||
/// (<c>ctx.GetTag("x").Value</c>) is byte-identical for both modes — the ONLY
|
||||
/// difference is which assembly closure the globalsType lives in.
|
||||
/// </para>
|
||||
/// </summary>
|
||||
public sealed class LeanCtx
|
||||
{
|
||||
private readonly System.Collections.Generic.Dictionary<string, DataValueSnapshot> _d = new();
|
||||
|
||||
public DataValueSnapshot GetTag(string p) =>
|
||||
_d.TryGetValue(p, out var v) ? v : new DataValueSnapshot(null, 0u, null, default);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// LEAN analogue of the prod <c>ScriptGlobals<TContext></c> wrapper: exposes a
|
||||
/// named <c>ctx</c> property so the script source can be byte-identical to the heavy
|
||||
/// path (<c>ctx.GetTag(...).Value</c>). Lives in the LeanContext assembly, so its
|
||||
/// reference closure is {LeanContext, Core.Abstractions} — NO Roslyn. This is the A0
|
||||
/// "globals type in a lean assembly" treatment.
|
||||
/// </summary>
|
||||
public sealed class LeanGlobals
|
||||
{
|
||||
public LeanCtx ctx { get; set; } = new();
|
||||
}
|
||||
@@ -0,0 +1,23 @@
|
||||
<Project Sdk="Microsoft.NET.Sdk">
|
||||
|
||||
<PropertyGroup>
|
||||
<OutputType>Exe</OutputType>
|
||||
<TargetFramework>net10.0</TargetFramework>
|
||||
<ImplicitUsings>enable</ImplicitUsings>
|
||||
<Nullable>enable</Nullable>
|
||||
<LangVersion>latest</LangVersion>
|
||||
<TreatWarningsAsErrors>false</TreatWarningsAsErrors>
|
||||
<GenerateDocumentationFile>false</GenerateDocumentationFile>
|
||||
</PropertyGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<!-- Same Roslyn version the repo pins (Directory.Packages.props => 4.12.0, CPM). -->
|
||||
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Scripting" />
|
||||
</ItemGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<ProjectReference Include="..\LeanContext\LeanContext.csproj" />
|
||||
<ProjectReference Include="..\..\..\src\Core\ZB.MOM.WW.OtOpcUa.Core.VirtualTags\ZB.MOM.WW.OtOpcUa.Core.VirtualTags.csproj" />
|
||||
</ItemGroup>
|
||||
|
||||
</Project>
|
||||
@@ -0,0 +1,78 @@
|
||||
using Microsoft.CodeAnalysis.CSharp.Scripting;
|
||||
using Microsoft.CodeAnalysis.Scripting;
|
||||
|
||||
// Memory measurement probe for Roslyn C# scripting per dotnet/roslyn#22219.
|
||||
// Compiles + RETAINS N distinct scripts (like the prod compiled-delegate cache) and
|
||||
// measures the per-script working-set cost. The ONLY thing that varies between "heavy"
|
||||
// and "lean" is the globalsType's assembly closure:
|
||||
// heavy = VirtualTagContext (closure pulls in Roslyn via Core.Scripting)
|
||||
// lean = LeanCtx (closure = {LeanContext, Core.Abstractions} only)
|
||||
|
||||
static long Rss() => System.Diagnostics.Process.GetCurrentProcess().WorkingSet64;
|
||||
|
||||
static void Settle()
|
||||
{
|
||||
for (int i = 0; i < 3; i++)
|
||||
{
|
||||
GC.Collect(2, GCCollectionMode.Forced, true, true);
|
||||
GC.WaitForPendingFinalizers();
|
||||
}
|
||||
}
|
||||
|
||||
var mode = args.Length > 0 ? args[0] : "heavy";
|
||||
int N = args.Length > 1 && int.TryParse(args[1], out var n) ? n : 50;
|
||||
|
||||
// The globalsType is what Roslyn's reference manager loads the transitive closure of
|
||||
// (dotnet/roslyn#22219). We mirror production's choice: prod uses the WRAPPER
|
||||
// ScriptGlobals<TContext> (which exposes the named `ctx` property), NOT the raw context.
|
||||
// heavy = ScriptGlobals<VirtualTagContext> -> wrapper lives in Core.Scripting AND the
|
||||
// generic arg VirtualTagContext lives in Core.VirtualTags; both -> Roslyn.
|
||||
// lean = LeanGlobals -> closure {LeanContext, Core.Abstractions},
|
||||
// NO Roslyn. This is the A0 "globals type in a lean assembly" treatment.
|
||||
var globalsType = mode == "lean"
|
||||
? typeof(LeanContext.LeanGlobals)
|
||||
: typeof(ZB.MOM.WW.OtOpcUa.Core.Scripting.ScriptGlobals<ZB.MOM.WW.OtOpcUa.Core.VirtualTags.VirtualTagContext>);
|
||||
|
||||
// The script reads ctx.GetTag("x").Value. We must reference: the globalsType's own
|
||||
// assembly, the assemblies of its generic type arguments (so `ctx`'s property type
|
||||
// resolves), and Core.Abstractions (DataValueSnapshot, the return type of GetTag).
|
||||
// References are minimal + identical in spirit for both modes; the real difference is
|
||||
// the transitive UNMANAGED closure of the globalsType's assembly that Roslyn's reference
|
||||
// manager loads per compilation (the #22219 effect).
|
||||
var snapshotAssembly = typeof(ZB.MOM.WW.OtOpcUa.Core.Abstractions.DataValueSnapshot).Assembly;
|
||||
var refAssemblies = new System.Collections.Generic.HashSet<System.Reflection.Assembly>
|
||||
{
|
||||
globalsType.Assembly,
|
||||
snapshotAssembly,
|
||||
};
|
||||
foreach (var ga in globalsType.GetGenericArguments())
|
||||
refAssemblies.Add(ga.Assembly);
|
||||
var opts = ScriptOptions.Default
|
||||
.WithReferences(refAssemblies)
|
||||
.WithImports();
|
||||
|
||||
// Warm up Roslyn once (compile 1 throwaway) so the baseline excludes one-time Roslyn init.
|
||||
_ = CSharpScript.Create<object>("return 0;", opts, globalsType).GetCompilation().GetDiagnostics();
|
||||
Settle();
|
||||
|
||||
long baseRss = Rss();
|
||||
long baseGc = GC.GetTotalMemory(true);
|
||||
|
||||
var held = new System.Collections.Generic.List<object>(N);
|
||||
for (int i = 0; i < N; i++)
|
||||
{
|
||||
var src = $"return ctx.GetTag(\"ref_{i}\").Value;";
|
||||
var script = CSharpScript.Create<object>(src, opts, globalsType);
|
||||
script.Compile(); // force compilation / emit
|
||||
held.Add(script.CreateDelegate()); // retain the compiled delegate (like the prod cache)
|
||||
}
|
||||
|
||||
Settle();
|
||||
long afterRss = Rss();
|
||||
long afterGc = GC.GetTotalMemory(true);
|
||||
GC.KeepAlive(held);
|
||||
|
||||
Console.WriteLine($"MODE={mode} N={N}");
|
||||
Console.WriteLine($" baseline RSS={baseRss / 1048576.0:F1} MiB managed={baseGc / 1048576.0:F1} MiB");
|
||||
Console.WriteLine($" afterN RSS={afterRss / 1048576.0:F1} MiB managed={afterGc / 1048576.0:F1} MiB");
|
||||
Console.WriteLine($" PER-SCRIPT: RSS={(afterRss - baseRss) / 1048576.0 / N:F2} MiB/script managed={(afterGc - baseGc) / 1048576.0 / N:F2} MiB/script");
|
||||
Reference in New Issue
Block a user