docs(design): VirtualTag/script memory scalability (A0+A+guardrail; C2 deferred) + measurement harness

This commit is contained in:
Joseph Doherty
2026-06-07 14:55:28 -04:00
parent 89c07fc382
commit 321d57938f
6 changed files with 337 additions and 0 deletions
@@ -0,0 +1,180 @@
# VirtualTag / Script Memory Scalability — Design (2026-06-07)
## Problem
Deploying the Northwind company overlay (1036 `VirtualTag`s, each a one-line mirror script
`return ctx.GetTag("…").Value;`) OOM-killed the central nodes — even with a 4 GiB container limit,
on a 15.6 GiB Docker VM. The node materialised the address space and spawned 1036 `VirtualTagActor`s,
then died as their scripts compiled.
### Root cause (measured)
The per-script cost of Roslyn C# scripting is dominated by Roslyn's reference manager materialising
the **transitive reference closure of the assembly that contains the script's `globalsType`** — and
it is almost entirely **unmanaged** memory (a managed-heap snapshot barely moves). This is the
long-standing, unfixed `dotnet/roslyn#22219` ("Backlog"): the reporter measured ~50 MiB/script and
OOM at ~39 scripts; the confirmed mitigation in that thread is "move the globalsType to a lean
assembly."
Our globals type reaches Roslyn: `ScriptGlobals<VirtualTagContext>``Core.VirtualTags`
`Core.Scripting``Microsoft.CodeAnalysis.CSharp.Scripting`. So every compile pays for the whole
Roslyn metadata closure.
### Measurement (probe in `tools/mem-probe/`, Roslyn 4.12.0, 50 retained scripts, 5 runs each)
| globals closure | per-script RSS | per-script managed |
|---|---|---|
| **Heavy**`VirtualTagContext` (reaches Roslyn) — *today* | **~18.2 MiB** (±25% noise) | 0.18 MiB |
| **Lean** — context in an Abstractions-only assembly (no Roslyn) | **~1.66 MiB** | 0.05 MiB |
**~11× reduction, ~99% unmanaged.** Real cost today ≈ **18 MiB/script** → 1036 vtags ≈ **~18 GiB**
(explains the instant OOM even at 4 GiB). The earlier "~3.5 MiB" guess was ~5× too low.
### Corpus survey (decides the Phase-2 grammar)
Real VirtualTag and ScriptedAlarm scripts use a **small bounded statement grammar**: tag reads
(`ctx.GetTag("lit").Value` / `.StatusCode`), explicit casts, arithmetic `+ - * / %`, comparisons,
boolean logic, ternary/if-else, local variables, a fixed function set (`Math.*`, `System.Convert.*`,
`ScriptContext.Deadband`), and `ctx.SetVirtualTag` (vtag only). The Roslyn sandbox *deliberately*
also allows `System.Linq` (Sum/Average/Where + lambdas) and `System.Text.RegularExpressions` — the
**long tail** that a future interpreter would Roslyn-fallback rather than reimplement. VirtualTag
value scripts and ScriptedAlarm predicate scripts share the grammar (alarm = same expression → bool,
no `SetVirtualTag`).
## Scope decision
- **Phase 1 (build now):** A0 (globals isolation) + A (passthrough fast-path) + a **warn-only**
deploy guardrail.
- **Phase 2 (spec only, deferred):** C2 interpret-hybrid — documented here, built later only if/when
thousands of genuinely *complex* (non-passthrough) scripts justify it.
- Both consumers (VirtualTag + ScriptedAlarm) are in scope for A0; A is VirtualTag-passthrough only.
---
## Phase 1 design
### A0 — Isolate the globals type from Roslyn (the 11× win)
Extract the **script-callable** types into a new lean assembly
**`ZB.MOM.WW.OtOpcUa.Core.Scripting.Abstractions`** that references **only `Core.Abstractions`
(+ Serilog)** and **never Roslyn**:
- `ScriptContext` (base), `ScriptGlobals<T>` (the globals wrapper), `VirtualTagContext`,
`AlarmPredicateContext`, and `ScriptContext.Deadband`.
- Leave the Roslyn users — `ScriptSandbox`, `ScriptEvaluator`, `RoslynVirtualTagEvaluator`,
`RoslynScriptedAlarmEvaluator`, `ForbiddenTypeAnalyzer`, `DependencyExtractor` — in
`Core.Scripting`; they now reference the lean assembly **downward** (lean ← never references the
Roslyn assembly).
- `Core.VirtualTags` / `Core.ScriptedAlarms` reference the lean assembly for the context types.
Net: the `globalsType`'s transitive closure becomes `{Core.Scripting.Abstractions, Core.Abstractions,
Serilog}`**no `Microsoft.CodeAnalysis.*`**. Pure dependency restructuring; **no behavior change**.
**Structural note (the boundary):** `ScriptContext`/`VirtualTagContext`/`AlarmPredicateContext` must
have no Roslyn-referencing members. Confirm `ScriptContext` doesn't pull in `ScriptEvaluator`/
`ScriptSandbox` types (it shouldn't — those are the *callers*). If any helper on the context needs a
Roslyn type, it stays behind in `Core.Scripting`.
**Test:** the `tools/mem-probe` harness re-run shows per-script RSS in the ~1.66 MiB regime; the full
`Core.Scripting`, `Core.VirtualTags`, `Core.ScriptedAlarms`, and Host-integration script/alarm test
suites stay green (behaviour-preserving).
### A — Passthrough fast-path (mirrors skip Roslyn entirely)
In the evaluator (`RoslynVirtualTagEvaluator.Evaluate`), **before** the `_cache.GetOrAdd(expression,
Compile)` lookup, detect the trivial mirror shape and short-circuit:
- Pattern (whitespace-tolerant): `^\s*return\s+ctx\.GetTag\(\s*"([^"]+)"\s*\)\.Value\s*;\s*$`.
- On match: return `dependencies[ref]` directly (the value is already passed into `Evaluate`) — **no
Roslyn, no cache entry, ~bytes.** Map a missing dep to the same `BadNodeIdUnknown`/no-change result
the Roslyn path would produce.
- Non-matches fall through to the existing Roslyn cache path unchanged.
- Downstream `DataType` coercion (in the actor) is unchanged — the fast-path returns the same raw
value the compiled script would have returned.
Covers 100% of the mirror overlay (the 1036). Keep it a narrow, exact pattern so a near-miss safely
falls through to Roslyn rather than mis-evaluating.
**Test:** passthrough returns the dep value with zero compilation (assert the cache stays empty);
a non-passthrough still compiles + works; a malformed near-match (`...Value + 1;`) falls through.
### Warn-only guardrail
In `AdminOperationsActor.HandleStartDeploymentAsync`, **after** the existing `DraftValidator` gate and
**non-blocking**:
- `compiled = count(unique script sources that are NOT the passthrough shape)` (from the snapshot's
`Script` rows; reuse the A pattern to classify).
- `estMiB ≈ compiled × perScriptMiB` (configurable, default ~1.66 — the measured post-A0 cost).
- Emit a structured `_log.Warning("StartDeployment: {Compiled} scripts will compile (~{EstMiB} MiB
RSS/node); ensure node mem_limit covers it", …)` and append an advisory line to the **`Accepted`**
`StartDeploymentResult.Message`.
- **Never rejects.** Operator-visible, operator-decided.
**Test:** a config with many distinct non-passthrough scripts logs the warning + still returns
`Accepted`; a passthrough-only config logs ~0 compiled.
---
## Phase 2 design (deferred — spec only)
### C2 — Interpret-hybrid (built later, if thousands of *complex* scripts appear)
A bounded interpreter that replaces Roslyn for the surveyed grammar, with Roslyn retained as the
fallback for the long tail. Memory per interpreted script ≈ KB (a parsed AST), vs ~1.66 MiB (post-A0)
for a Roslyn-compiled one; scales to tens of thousands of complex scripts.
- **Grammar (statement language, no loops/methods/classes):** literals (numeric/string/bool), the
context API (`ctx.GetTag(lit)` + `.Value`/`.StatusCode`/timestamps, `ctx.SetVirtualTag(lit, expr)`,
`ctx.Now`, `ctx.Logger`), explicit casts `(int)`/`(double)`/`(bool)`, arithmetic `+ - * / %`,
comparisons `< > <= >= == !=`, boolean `&& || !`, ternary `?:`, `if/else`, `var` local bindings,
a fixed allow-listed function set (`Math.*`, `System.Convert.*`, `ScriptContext.Deadband`), and
`return`.
- **Evaluator:** start with a tree-walking interpreter (lowest memory); optionally compile the AST to
a `System.Linq.Expressions` delegate (C2-compiled) for hot tags — still collectible `DynamicMethod`s,
~KB, far below Roslyn.
- **Hybrid contract:** a classifier parses each script; if it's within the grammar → interpret; else
(LINQ, Regex, anything unrecognised) → Roslyn fallback (today's path). The deploy/warn guardrail
then counts only the *fallback* scripts.
- **Both consumers:** one engine serves VirtualTag value scripts (return value coerced to `DataType`)
and ScriptedAlarm predicates (return bool); the only difference is the return-type contract and that
alarm predicates reject `SetVirtualTag`.
- **Bonus:** interpreted scripts are a *hard* sandbox by construction — `ScriptSandbox` /
`ForbiddenTypeAnalyzer` (the curated metadata allow/deny machinery) only need guard the rare Roslyn
fallback path.
- **Risks (why deferred):** you own a small language (grammar/parser/semantics/error messages + tests);
C#-semantic-parity edge cases (int vs float division, overflow, null propagation); a classifier +
two engines to maintain. Worth it only at real complex-script scale.
---
## Verification
1. **Memory probe** (`tools/mem-probe/`, retained as the measurement artifact): re-run after A0 shows
the ~11× per-script drop.
2. **Live docker-dev proof (the real acceptance gate):** re-deploy the full **1036-vtag overlay** with
the A0+A build and confirm the deploy is **Accepted** *and* the central node **stays under its
`mem_limit`** through materialisation + value streaming (no `OOMKilled`). This is what proves the
outage is actually gone.
3. Unit tests: passthrough fast-path + warn-guardrail.
4. Existing `Core.Scripting` / `Core.VirtualTags` / `Core.ScriptedAlarms` / Host script+alarm suites
stay green (A0 is behaviour-preserving).
## Sequencing & risk
| Step | Risk | Notes |
|---|---|---|
| A0 (assembly split) | medium — touches assembly layout across `Core.Scripting`/`VirtualTags`/`ScriptedAlarms` + Host refs | behaviour-preserving; the measured 11× payoff; do first |
| A (passthrough) | low — narrow exact pattern in one evaluator method | additive; covers the mirror overlay |
| guardrail | low — non-blocking log + message | additive |
| C2 | — | **deferred**; spec only |
A0 first (it moves types); A + guardrail are additive on top. The Phase-2 spec is documentation only.
## Related context
- `dotnet/roslyn#22219` — the upstream issue (globalsType-closure memory, mostly unmanaged, no fix).
- Measurement harness: `tools/mem-probe/` (this branch).
- Recovery already shipped this session: `docker-dev` `mem_limit 1g→2g` (`master` `89c07fc`) +
cleared the OOM-causing sealed deployments. The full-validator deploy gate
(`AdminOperationsActor` + `DraftValidator`) is where the warn-guardrail hooks in.