docs(design): VirtualTag/script memory scalability (A0+A+guardrail; C2 deferred) + measurement harness
This commit is contained in:
@@ -0,0 +1,180 @@
|
||||
# VirtualTag / Script Memory Scalability — Design (2026-06-07)
|
||||
|
||||
## Problem
|
||||
|
||||
Deploying the Northwind company overlay (1036 `VirtualTag`s, each a one-line mirror script
|
||||
`return ctx.GetTag("…").Value;`) OOM-killed the central nodes — even with a 4 GiB container limit,
|
||||
on a 15.6 GiB Docker VM. The node materialised the address space and spawned 1036 `VirtualTagActor`s,
|
||||
then died as their scripts compiled.
|
||||
|
||||
### Root cause (measured)
|
||||
|
||||
The per-script cost of Roslyn C# scripting is dominated by Roslyn's reference manager materialising
|
||||
the **transitive reference closure of the assembly that contains the script's `globalsType`** — and
|
||||
it is almost entirely **unmanaged** memory (a managed-heap snapshot barely moves). This is the
|
||||
long-standing, unfixed `dotnet/roslyn#22219` ("Backlog"): the reporter measured ~50 MiB/script and
|
||||
OOM at ~39 scripts; the confirmed mitigation in that thread is "move the globalsType to a lean
|
||||
assembly."
|
||||
|
||||
Our globals type reaches Roslyn: `ScriptGlobals<VirtualTagContext>` → `Core.VirtualTags` →
|
||||
`Core.Scripting` → `Microsoft.CodeAnalysis.CSharp.Scripting`. So every compile pays for the whole
|
||||
Roslyn metadata closure.
|
||||
|
||||
### Measurement (probe in `tools/mem-probe/`, Roslyn 4.12.0, 50 retained scripts, 5 runs each)
|
||||
|
||||
| globals closure | per-script RSS | per-script managed |
|
||||
|---|---|---|
|
||||
| **Heavy** — `VirtualTagContext` (reaches Roslyn) — *today* | **~18.2 MiB** (±25% noise) | 0.18 MiB |
|
||||
| **Lean** — context in an Abstractions-only assembly (no Roslyn) | **~1.66 MiB** | 0.05 MiB |
|
||||
|
||||
→ **~11× reduction, ~99% unmanaged.** Real cost today ≈ **18 MiB/script** → 1036 vtags ≈ **~18 GiB**
|
||||
(explains the instant OOM even at 4 GiB). The earlier "~3.5 MiB" guess was ~5× too low.
|
||||
|
||||
### Corpus survey (decides the Phase-2 grammar)
|
||||
|
||||
Real VirtualTag and ScriptedAlarm scripts use a **small bounded statement grammar**: tag reads
|
||||
(`ctx.GetTag("lit").Value` / `.StatusCode`), explicit casts, arithmetic `+ - * / %`, comparisons,
|
||||
boolean logic, ternary/if-else, local variables, a fixed function set (`Math.*`, `System.Convert.*`,
|
||||
`ScriptContext.Deadband`), and `ctx.SetVirtualTag` (vtag only). The Roslyn sandbox *deliberately*
|
||||
also allows `System.Linq` (Sum/Average/Where + lambdas) and `System.Text.RegularExpressions` — the
|
||||
**long tail** that a future interpreter would Roslyn-fallback rather than reimplement. VirtualTag
|
||||
value scripts and ScriptedAlarm predicate scripts share the grammar (alarm = same expression → bool,
|
||||
no `SetVirtualTag`).
|
||||
|
||||
## Scope decision
|
||||
|
||||
- **Phase 1 (build now):** A0 (globals isolation) + A (passthrough fast-path) + a **warn-only**
|
||||
deploy guardrail.
|
||||
- **Phase 2 (spec only, deferred):** C2 interpret-hybrid — documented here, built later only if/when
|
||||
thousands of genuinely *complex* (non-passthrough) scripts justify it.
|
||||
- Both consumers (VirtualTag + ScriptedAlarm) are in scope for A0; A is VirtualTag-passthrough only.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 design
|
||||
|
||||
### A0 — Isolate the globals type from Roslyn (the 11× win)
|
||||
|
||||
Extract the **script-callable** types into a new lean assembly
|
||||
**`ZB.MOM.WW.OtOpcUa.Core.Scripting.Abstractions`** that references **only `Core.Abstractions`
|
||||
(+ Serilog)** and **never Roslyn**:
|
||||
|
||||
- `ScriptContext` (base), `ScriptGlobals<T>` (the globals wrapper), `VirtualTagContext`,
|
||||
`AlarmPredicateContext`, and `ScriptContext.Deadband`.
|
||||
- Leave the Roslyn users — `ScriptSandbox`, `ScriptEvaluator`, `RoslynVirtualTagEvaluator`,
|
||||
`RoslynScriptedAlarmEvaluator`, `ForbiddenTypeAnalyzer`, `DependencyExtractor` — in
|
||||
`Core.Scripting`; they now reference the lean assembly **downward** (lean ← never references the
|
||||
Roslyn assembly).
|
||||
- `Core.VirtualTags` / `Core.ScriptedAlarms` reference the lean assembly for the context types.
|
||||
|
||||
Net: the `globalsType`'s transitive closure becomes `{Core.Scripting.Abstractions, Core.Abstractions,
|
||||
Serilog}` — **no `Microsoft.CodeAnalysis.*`**. Pure dependency restructuring; **no behavior change**.
|
||||
|
||||
**Structural note (the boundary):** `ScriptContext`/`VirtualTagContext`/`AlarmPredicateContext` must
|
||||
have no Roslyn-referencing members. Confirm `ScriptContext` doesn't pull in `ScriptEvaluator`/
|
||||
`ScriptSandbox` types (it shouldn't — those are the *callers*). If any helper on the context needs a
|
||||
Roslyn type, it stays behind in `Core.Scripting`.
|
||||
|
||||
**Test:** the `tools/mem-probe` harness re-run shows per-script RSS in the ~1.66 MiB regime; the full
|
||||
`Core.Scripting`, `Core.VirtualTags`, `Core.ScriptedAlarms`, and Host-integration script/alarm test
|
||||
suites stay green (behaviour-preserving).
|
||||
|
||||
### A — Passthrough fast-path (mirrors skip Roslyn entirely)
|
||||
|
||||
In the evaluator (`RoslynVirtualTagEvaluator.Evaluate`), **before** the `_cache.GetOrAdd(expression,
|
||||
Compile)` lookup, detect the trivial mirror shape and short-circuit:
|
||||
|
||||
- Pattern (whitespace-tolerant): `^\s*return\s+ctx\.GetTag\(\s*"([^"]+)"\s*\)\.Value\s*;\s*$`.
|
||||
- On match: return `dependencies[ref]` directly (the value is already passed into `Evaluate`) — **no
|
||||
Roslyn, no cache entry, ~bytes.** Map a missing dep to the same `BadNodeIdUnknown`/no-change result
|
||||
the Roslyn path would produce.
|
||||
- Non-matches fall through to the existing Roslyn cache path unchanged.
|
||||
- Downstream `DataType` coercion (in the actor) is unchanged — the fast-path returns the same raw
|
||||
value the compiled script would have returned.
|
||||
|
||||
Covers 100% of the mirror overlay (the 1036). Keep it a narrow, exact pattern so a near-miss safely
|
||||
falls through to Roslyn rather than mis-evaluating.
|
||||
|
||||
**Test:** passthrough returns the dep value with zero compilation (assert the cache stays empty);
|
||||
a non-passthrough still compiles + works; a malformed near-match (`...Value + 1;`) falls through.
|
||||
|
||||
### Warn-only guardrail
|
||||
|
||||
In `AdminOperationsActor.HandleStartDeploymentAsync`, **after** the existing `DraftValidator` gate and
|
||||
**non-blocking**:
|
||||
|
||||
- `compiled = count(unique script sources that are NOT the passthrough shape)` (from the snapshot's
|
||||
`Script` rows; reuse the A pattern to classify).
|
||||
- `estMiB ≈ compiled × perScriptMiB` (configurable, default ~1.66 — the measured post-A0 cost).
|
||||
- Emit a structured `_log.Warning("StartDeployment: {Compiled} scripts will compile (~{EstMiB} MiB
|
||||
RSS/node); ensure node mem_limit covers it", …)` and append an advisory line to the **`Accepted`**
|
||||
`StartDeploymentResult.Message`.
|
||||
- **Never rejects.** Operator-visible, operator-decided.
|
||||
|
||||
**Test:** a config with many distinct non-passthrough scripts logs the warning + still returns
|
||||
`Accepted`; a passthrough-only config logs ~0 compiled.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 design (deferred — spec only)
|
||||
|
||||
### C2 — Interpret-hybrid (built later, if thousands of *complex* scripts appear)
|
||||
|
||||
A bounded interpreter that replaces Roslyn for the surveyed grammar, with Roslyn retained as the
|
||||
fallback for the long tail. Memory per interpreted script ≈ KB (a parsed AST), vs ~1.66 MiB (post-A0)
|
||||
for a Roslyn-compiled one; scales to tens of thousands of complex scripts.
|
||||
|
||||
- **Grammar (statement language, no loops/methods/classes):** literals (numeric/string/bool), the
|
||||
context API (`ctx.GetTag(lit)` + `.Value`/`.StatusCode`/timestamps, `ctx.SetVirtualTag(lit, expr)`,
|
||||
`ctx.Now`, `ctx.Logger`), explicit casts `(int)`/`(double)`/`(bool)`, arithmetic `+ - * / %`,
|
||||
comparisons `< > <= >= == !=`, boolean `&& || !`, ternary `?:`, `if/else`, `var` local bindings,
|
||||
a fixed allow-listed function set (`Math.*`, `System.Convert.*`, `ScriptContext.Deadband`), and
|
||||
`return`.
|
||||
- **Evaluator:** start with a tree-walking interpreter (lowest memory); optionally compile the AST to
|
||||
a `System.Linq.Expressions` delegate (C2-compiled) for hot tags — still collectible `DynamicMethod`s,
|
||||
~KB, far below Roslyn.
|
||||
- **Hybrid contract:** a classifier parses each script; if it's within the grammar → interpret; else
|
||||
(LINQ, Regex, anything unrecognised) → Roslyn fallback (today's path). The deploy/warn guardrail
|
||||
then counts only the *fallback* scripts.
|
||||
- **Both consumers:** one engine serves VirtualTag value scripts (return value coerced to `DataType`)
|
||||
and ScriptedAlarm predicates (return bool); the only difference is the return-type contract and that
|
||||
alarm predicates reject `SetVirtualTag`.
|
||||
- **Bonus:** interpreted scripts are a *hard* sandbox by construction — `ScriptSandbox` /
|
||||
`ForbiddenTypeAnalyzer` (the curated metadata allow/deny machinery) only need guard the rare Roslyn
|
||||
fallback path.
|
||||
- **Risks (why deferred):** you own a small language (grammar/parser/semantics/error messages + tests);
|
||||
C#-semantic-parity edge cases (int vs float division, overflow, null propagation); a classifier +
|
||||
two engines to maintain. Worth it only at real complex-script scale.
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
1. **Memory probe** (`tools/mem-probe/`, retained as the measurement artifact): re-run after A0 shows
|
||||
the ~11× per-script drop.
|
||||
2. **Live docker-dev proof (the real acceptance gate):** re-deploy the full **1036-vtag overlay** with
|
||||
the A0+A build and confirm the deploy is **Accepted** *and* the central node **stays under its
|
||||
`mem_limit`** through materialisation + value streaming (no `OOMKilled`). This is what proves the
|
||||
outage is actually gone.
|
||||
3. Unit tests: passthrough fast-path + warn-guardrail.
|
||||
4. Existing `Core.Scripting` / `Core.VirtualTags` / `Core.ScriptedAlarms` / Host script+alarm suites
|
||||
stay green (A0 is behaviour-preserving).
|
||||
|
||||
## Sequencing & risk
|
||||
|
||||
| Step | Risk | Notes |
|
||||
|---|---|---|
|
||||
| A0 (assembly split) | medium — touches assembly layout across `Core.Scripting`/`VirtualTags`/`ScriptedAlarms` + Host refs | behaviour-preserving; the measured 11× payoff; do first |
|
||||
| A (passthrough) | low — narrow exact pattern in one evaluator method | additive; covers the mirror overlay |
|
||||
| guardrail | low — non-blocking log + message | additive |
|
||||
| C2 | — | **deferred**; spec only |
|
||||
|
||||
A0 first (it moves types); A + guardrail are additive on top. The Phase-2 spec is documentation only.
|
||||
|
||||
## Related context
|
||||
|
||||
- `dotnet/roslyn#22219` — the upstream issue (globalsType-closure memory, mostly unmanaged, no fix).
|
||||
- Measurement harness: `tools/mem-probe/` (this branch).
|
||||
- Recovery already shipped this session: `docker-dev` `mem_limit 1g→2g` (`master` `89c07fc`) +
|
||||
cleared the OOM-causing sealed deployments. The full-validator deploy gate
|
||||
(`AdminOperationsActor` + `DraftValidator`) is where the warn-guardrail hooks in.
|
||||
Reference in New Issue
Block a user