Files
lmxopcua/code-reviews/Core.Scripting/findings.md
Joseph Doherty 7b6ab2ec6f fix(scripting): unload compiled-script assemblies via collectible ALC
Core.Scripting-008 resolution: replace the legacy CSharpScript.CreateDelegate
path with hand-rolled CSharpCompilation + Emit + collectible AssemblyLoadContext,
so per-publish compile accretion no longer requires a server restart to reclaim.

Why this was needed:
  Roslyn's CSharpScript path emits dynamically-compiled script assemblies into
  the default AssemblyLoadContext, which is non-collectible. Across config-
  publish generations each Clear() drops dictionary entries but the emitted
  assemblies stay loaded for process lifetime, so memory grows steadily on
  long-running servers with frequent publishes. The accepted-limitation note
  in docs/VirtualTags.md recommended scheduled restarts as the workaround;
  operator feedback was that restarts are difficult, so the underlying
  limitation was the right thing to fix.

Implementation:
  - New ScriptAssemblyLoadContext(name, isCollectible: true) hosts one emitted
    script assembly per evaluator.
  - ScriptEvaluator.Compile synthesises a wrapper class around the user source
    (CompiledScript.Run(globals) — explicit return required per ordinary C#
    semantics, which every existing script already uses), builds a
    CSharpCompilation against the sandbox references, runs the
    ForbiddenTypeAnalyzer over the semantic model unchanged, emits to an
    in-memory PE stream, loads via ScriptAssemblyLoadContext.LoadFromStream,
    and binds a strongly-typed Func<ScriptGlobals<TContext>, TResult> delegate
    via reflection.
  - ScriptEvaluator now implements IDisposable — Dispose calls
    AssemblyLoadContext.Unload(), which makes the emitted assembly eligible
    for GC at the next collection cycle.
  - CompiledScriptCache.Clear() disposes every materialised evaluator before
    dropping its dictionary entry; CompiledScriptCache itself is now
    IDisposable for graceful server shutdown.
  - ScriptSandbox.Build returns a new SandboxConfig (References + Imports)
    instead of a Roslyn ScriptOptions; references now span BCL via the
    TRUSTED_PLATFORM_ASSEMBLIES set filtered to System.* + netstandard +
    Microsoft.Win32.Registry, so forbidden BCL types resolve at compile and
    ForbiddenTypeAnalyzer is the sole security gate (consistent with the
    Core.Scripting-001 / -002 model — references-list-only restriction is
    porous against type forwarding, so the analyzer must be the real gate).

Verification:
  - All 104 Core.Scripting tests pass (was 101 — three new regression tests
    locking the unload contract).
  - All 56 VirtualTags tests pass (unchanged).
  - All 63 ScriptedAlarms tests pass (unchanged).
  - New CompiledScriptCacheTests:
    - Dispose_unloads_compiled_script_assembly_load_context — proves single-
      evaluator ALC unload via WeakReference + bounded GC.Collect() loop.
    - Clear_disposes_every_materialised_evaluator — proves publish-replace
      releases every prior generation's ALC.
    - GetOrCompile_after_Dispose_throws_ObjectDisposedException — locks the
      post-dispose contract.

Docs:
  - docs/VirtualTags.md "Compile cache" section rewritten: the accepted-
    limitation note replaced with the unload contract + the new authoring
    convention (explicit return).
  - docs/ScriptedAlarms.md cross-reference updated to drop the obsolete
    restart guidance.
  - code-reviews/Core.Scripting/findings.md Core.Scripting-008 flipped
    Won't Fix → Resolved with the implementation summary.
  - code-reviews/README.md regenerated.

Pre-existing breakage note: Driver.Galaxy fails the solution-wide build on
master because its ProjectReference to the sibling mxaccessgw repo's
MxGateway.Client targets a path that the sibling repo no longer has after a
recent restructuring. This is unrelated to Core.Scripting-008 and was
verified to exist on master before this branch was cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 15:55:04 -04:00

22 KiB

Code Review — Core.Scripting

Field Value
Module src/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting
Reviewer Claude Code
Review date 2026-05-22
Commit reviewed 76d35d1
Status Reviewed
Open findings 0

Checklist coverage

A comprehensive review completes every category, recording "No issues found" where a category produced nothing rather than leaving it blank.

# Category Result
1 Correctness & logic bugs Core.Scripting-004, Core.Scripting-005
2 OtOpcUa conventions No issues found
3 Concurrency & thread safety Core.Scripting-006
4 Error handling & resilience Core.Scripting-007
5 Security Core.Scripting-001, Core.Scripting-002, Core.Scripting-003
6 Performance & resource management Core.Scripting-008
7 Design-document adherence Core.Scripting-009
8 Code organization & conventions No issues found
9 Testing coverage Core.Scripting-010, Core.Scripting-011
10 Documentation & comments No issues found

Findings

Core.Scripting-001

Field Value
Severity Critical
Category Security
Location ForbiddenTypeAnalyzer.cs:45, ScriptSandbox.cs:54
Status Resolved

Description: System.Environment lives in the allowed System namespace (it is in System.Private.CoreLib, which is allow-listed for primitives) and is not on the forbidden-namespace deny-list. Nothing prevents an operator-authored script from calling System.Environment.Exit(0) or System.Environment.FailFast("..."). Both terminate the host process immediately. Because scripted-alarm predicates and virtual-tag scripts run in-process in the main OPC UA server (decision: "Scripting engine runs in the main .NET 10 server process"), a single malicious or buggy predicate brings down the entire server — an outage affecting every connected client and every driver. ScriptSandboxTests only pins the read path (Environment.GetEnvironmentVariable) as an accepted compromise; the process-killing members are not considered. The whole-process kill far exceeds the "read-only process state" justification the test comments rely on.

Recommendation: The forbidden surface must be member-granular, not namespace-granular, for types in allowed namespaces. Add an explicit forbidden-member deny-list to ForbiddenTypeAnalyzer covering at minimum System.Environment.Exit, System.Environment.FailFast, System.AppDomain, System.GC (e.g. GC.Collect, GC.AddMemoryPressure), and System.Activator.CreateInstance (a reflection-equivalent escape). Reject these in CheckSymbol by resolved method symbol, with a test for each.

Resolution: Resolved 2026-05-22 — added a type-granular ForbiddenFullTypeNames deny-list (System.Environment, System.AppDomain, System.GC, System.Activator) to ForbiddenTypeAnalyzer; CheckSymbol now rejects any resolved type symbol whose fully-qualified name matches, alongside the existing namespace-prefix check, so dangerous System-namespace process-control types are blocked at compile while legitimate System types (Math, String, …) stay usable. Regression tests added in ScriptSandboxTests for Environment.Exit / Environment.FailFast / Environment.GetEnvironmentVariable / AppDomain / GC.Collect / Activator.CreateInstance.

Core.Scripting-002

Field Value
Severity High
Category Security
Location ForbiddenTypeAnalyzer.cs:70
Status Resolved

Description: The syntax walker only inspects four node kinds: ObjectCreationExpressionSyntax, InvocationExpressionSyntax with a member-access target, MemberAccessExpressionSyntax, and bare IdentifierNameSyntax. It never visits TypeOfExpressionSyntax, generic type-argument lists (GenericNameSyntax / TypeArgumentListSyntax), cast expressions (CastExpressionSyntax), is/as type patterns, default(T) expressions, array-creation element types, or using/local declared types. A script such as typeof(System.IO.File), new System.Collections.Generic.List<System.IO.FileInfo>(), (System.IO.Stream)null, or default(System.Reflection.Assembly) references a forbidden type without ever producing a node the walker examines, so the forbidden-type check is bypassed. The Phase 7 plan A.6 explicitly calls out typeof as a sandbox-escape attempt that "must fail at compile" — it currently does not.

Recommendation: Walk every TypeSyntax node (handle TypeOfExpressionSyntax, CastExpressionSyntax, generic argument lists, and the type operand of IsPatternExpression / binary as). The simplest robust fix is to enumerate all DescendantNodes() and, for any node, resolve both GetSymbolInfo and GetTypeInfo, then check the resolved type plus every type argument. Add tests covering typeof, generic arguments, casts, and default(T) with forbidden types.

Resolution: Resolved 2026-05-22 — ForbiddenTypeAnalyzer.Analyze now runs a second pass that resolves GetTypeInfo on every TypeSyntax node and recursively unwraps array element types and generic type arguments, so forbidden types named via typeof, generic arguments (List<FileInfo>), casts, is/as patterns, default(T), array-creation element types, and explicitly-typed local declarations are all rejected at compile. The original member/call node-kind switch is kept (deliberately narrow to avoid flagging inherited members such as typeof(int).Name), and a span+type dedupe prevents duplicate rejections from the two passes. Regression tests added in ScriptSandboxTests for each node form plus over-block guards for allowed generics/typeof.

Core.Scripting-003

Field Value
Severity Medium
Category Security
Location TimedScriptEvaluator.cs:9, ScriptSandbox.cs:30
Status Resolved

Description: There is no bound on memory a script may allocate or on the number of threads/tasks a script may spawn. The class docs acknowledge unbounded memory as "a budget concern" deferred to v3, but in-process execution means a script doing new byte[int.MaxValue] repeatedly (or Enumerable.Range(0,int.MaxValue).ToList() — LINQ is allow-listed) can drive the whole server to OutOfMemoryException, an outage. The timeout does not help: the allocation can exhaust memory well before 250ms elapses, and the orphaned thread-pool thread documented in TimedScriptEvaluator keeps the allocation rooted. System.Threading.Tasks is not on the deny-list, so a script can also Task.Run an unbounded fan-out of background work that outlives the timeout entirely.

Recommendation: At minimum, document this as a known accepted risk in docs/ScriptedAlarms.md / docs/VirtualTags.md rather than only in a code comment, and add the Task/Parallel namespaces to the forbidden list (scripts are synchronous predicates — they have no legitimate need to start background tasks). For memory, gate script authoring behind an Admin permission and treat the test-harness preview as the control point, or track an explicit v3 issue for out-of-process execution. Record the decision so it is not silently lost.

Resolution: Resolved 2026-05-22 — added System.Threading.Tasks to ForbiddenNamespacePrefixes (blocking Task.Run / Parallel fan-out); documented the unbounded-memory accepted risk and the Task denial rationale in docs/VirtualTags.md (new "Known resource limits" subsection) and cross-referenced from docs/ScriptedAlarms.md.

Core.Scripting-004

Field Value
Severity Medium
Category Correctness & logic bugs
Location DependencyExtractor.cs:73
Status Resolved

Description: The walker matches tag-access calls purely by spelling — any InvocationExpressionSyntax whose member name is GetTag or SetVirtualTag is treated as a ScriptContext tag access, regardless of the receiver. A script that defines a local type with a GetTag(string) method and calls other.GetTag("X"), or calls this.GetTag(...) on a script-defined helper, has spurious dependencies harvested (or, if the literal arg is non-literal, spurious rejections raised). The XML remarks claim "as long as it's not on the ctx instance, the extractor doesn't pick it up", but the code does not check that the receiver is the ctx identifier — it accepts any member access with the matching name. The DependencyExtractorTests.Ignores_non_ctx_method_named_GetTag test passes only because the helper there is a free function (not member-access form); a member-access call to a non-ctx GetTag is untested and would be misattributed.

Recommendation: In VisitInvocationExpression, additionally require that member.Expression is an IdentifierNameSyntax with Identifier.ValueText == "ctx" (matching the ScriptGlobals<TContext>.ctx field name). Add a test for someOtherObject.GetTag("X") asserting it is ignored.

Resolution: Resolved 2026-05-22 — VisitInvocationExpression now additionally checks that member.Expression is an IdentifierNameSyntax with ValueText == "ctx" before treating the call as a dependency; test Ignores_member_access_GetTag_on_non_ctx_receiver added to DependencyExtractorTests.

Core.Scripting-005

Field Value
Severity Low
Category Correctness & logic bugs
Location DependencyExtractor.cs:97
Status Resolved

Description: A raw string literal token passed as the tag path (a raw triple-quote literal) tokenizes as SingleLineRawStringLiteralToken / MultiLineRawStringLiteralToken, not StringLiteralToken. The check literal.Token.IsKind(SyntaxKind.StringLiteralToken) therefore rejects an otherwise-static raw-string path as a non-literal "dynamic path", producing a misleading rejection message. This is an edge case (operators rarely write raw strings for tag paths) but the error text would confuse anyone who does.

Recommendation: Accept all string-literal token kinds — check literal.IsKind(SyntaxKind.StringLiteralExpression) on the expression node, or include the raw-string token kinds, so a static raw string is harvested rather than rejected.

Resolution: Resolved 2026-05-23 — HandleTagCall now checks literal.IsKind(SyntaxKind.StringLiteralExpression) on the expression node, which covers regular string literals, single-line raw strings, and multi-line raw strings uniformly. Regression tests Accepts_single_line_raw_string_literal_path and Accepts_multi_line_raw_string_literal_path added to DependencyExtractorTests.

Core.Scripting-006

Field Value
Severity Low
Category Concurrency & thread safety
Location CompiledScriptCache.cs:55
Status Resolved

Description: On a failed compile the catch block calls _cache.TryRemove(key, out _) without a value comparison. If two threads race a miss for the same bad source, both observe the same faulted Lazy and throw, and both call TryRemove(key). If a concurrent retry re-adds a new Lazy for that key between the two removals, the second unconditional TryRemove could evict the in-flight retry entry. The window is small and the consequence is only a redundant recompile, so severity is Low — but the removal should be key+value scoped for correctness.

Recommendation: Use the ConcurrentDictionary.TryRemove(KeyValuePair<,>) overload to remove only the specific faulted Lazy instance, so a concurrently re-added entry is not evicted.

Resolution: Resolved 2026-05-23 — GetOrCompile's catch block now evicts via _cache.TryRemove(new KeyValuePair<string, Lazy<…>>(key, lazy)), comparing the value reference so only the faulted Lazy is removed; a concurrent retry that re-inserted a fresh Lazy under the same key is preserved. Regression test Failed_compile_eviction_does_not_remove_a_concurrent_retry_entry added to CompiledScriptCacheTests (reflection-driven deterministic race: the faulted Lazy's factory swaps the dictionary entry to a fresh Lazy as a side effect of its throw, modelling the precise race window).

Core.Scripting-007

Field Value
Severity Medium
Category Error handling & resilience
Location TimedScriptEvaluator.cs:60
Status Resolved

Description: RunAsync wraps the inner run in Task.Run(...) and then awaits WaitAsync(Timeout, ct). If the caller-supplied ct cancels at roughly the same time the timeout elapses, the order in which WaitAsync observes the timeout vs. the cancellation is non-deterministic, so the same shutdown can sometimes surface as ScriptTimeoutException and sometimes as OperationCanceledException. The class docs assert "the caller's cancel wins" as a hard guarantee that the virtual-tag engine shutdown path depends on to avoid misclassifying shutdown as a script fault — but the implementation does not guarantee it when both fire close together.

Recommendation: After catching TimeoutException, check ct.IsCancellationRequested and throw OperationCanceledException(ct) instead of ScriptTimeoutException when the caller's token is cancelled, so caller cancellation deterministically wins regardless of race ordering.

Resolution: Resolved 2026-05-22 — in the catch (TimeoutException) handler, ct.IsCancellationRequested is now checked and OperationCanceledException(ct) thrown before ScriptTimeoutException, so caller cancellation deterministically wins regardless of race ordering; regression test Caller_cancellation_wins_even_when_timeout_fires_first added to TimedScriptEvaluatorTests.

Core.Scripting-008

Field Value
Severity Low
Category Performance & resource management
Location CompiledScriptCache.cs:34, ScriptEvaluator.cs:34
Status Resolved

Description: CompiledScriptCache has no capacity bound (acknowledged in the class remarks) and no eviction. Each cached ScriptEvaluator holds a Roslyn ScriptRunner<T> delegate, which keeps the dynamically emitted script assembly loaded for the process lifetime — emitted assemblies in the default AssemblyLoadContext cannot be unloaded. Clear() drops the dictionary entries but does not unload the emitted assemblies; they leak. Across many config-generation publishes (each Clear() followed by recompiling every script), the process accumulates dead script assemblies. For the expected "low thousands" of scripts this is benign, but a long-running server with frequent publishes will see steady managed-memory growth that never returns.

Recommendation: Document the per-publish assembly accretion as a known limitation, or compile scripts into a collectible AssemblyLoadContext so Clear() can unload prior generations. At minimum add a note to docs/ScriptedAlarms.md so operators with high-publish-frequency deployments are aware.

Resolution: Resolved 2026-05-23 — switched the compile pipeline off the legacy CSharpScript.CreateDelegate path (which emits into the default, non-collectible AssemblyLoadContext) and onto a hand-rolled CSharpCompilationCompilation.Emit(MemoryStream)ScriptAssemblyLoadContext.LoadFromStream chain, with the new ScriptAssemblyLoadContext constructed isCollectible: true. Each compiled script lives in its own ALC; ScriptEvaluator now implements IDisposable and calls AssemblyLoadContext.Unload() on dispose. CompiledScriptCache.Clear() disposes every materialised evaluator before dropping its dictionary entry, and CompiledScriptCache itself is now IDisposable for graceful server shutdown. After a publish-replace cycle the prior generation's emitted assemblies become eligible for GC; the reclaim is GC-timing-sensitive (Unload is eligible-for-collection, not synchronous) and the next collection cycle reclaims them. The references list is now BCL-wide (System.* + netstandard + Microsoft.Win32.Registry via the TRUSTED_PLATFORM_ASSEMBLIES set) so forbidden BCL types resolve at compile and ForbiddenTypeAnalyzer is the sole security gate (consistent with the Core.Scripting-001 / -002 model). docs/VirtualTags.md "Compile cache" section rewritten; docs/ScriptedAlarms.md cross-reference updated to drop the obsolete restart guidance. Regression tests added in CompiledScriptCacheTests: Dispose_unloads_compiled_script_assembly_load_context, Clear_disposes_every_materialised_evaluator, and GetOrCompile_after_Dispose_throws_ObjectDisposedException; the first two prove ALC unload via WeakReference + bounded GC.Collect() loops. Suite now 104 green (was 101). Authoring convention: the synthesized wrapper is an ordinary C# static method, so scripts must end with explicit return …; per ordinary C# rules (the legacy CSharpScript "last expression yields result" shorthand no longer applies); every script in the existing corpus already uses explicit return so this is a doc-only change for new authors.

Core.Scripting-009

Field Value
Severity Low
Category Design-document adherence
Location ForbiddenTypeAnalyzer.cs:45
Status Resolved

Description: The Phase 7 plan decision #6 (docs/v2/implementation/phase-7-scripting-and-alarming.md) enumerates the forbidden surface as "No HttpClient / File / Process / reflection". ForbiddenTypeAnalyzer actually denies a broader set — System.Threading.Thread, System.Runtime.InteropServices, and Microsoft.Win32 (registry) — which is sensible hardening but is undocumented in the plan and in docs/ScriptedAlarms.md (which defers sandbox rules to VirtualTags.md). An operator reading the design docs cannot predict that a registry or interop reference will be rejected. Conversely the plan does not record the System.Environment / System.Diagnostics decisions. The code and the design document have drifted.

Recommendation: Update the plan's decision #6 (or docs/VirtualTags.md) to list the authoritative deny-list exactly as ForbiddenTypeAnalyzer.ForbiddenNamespacePrefixes defines it, including the System.Environment allowed-compromise, so the docs match the code.

Resolution: Resolved 2026-05-23 — docs/v2/implementation/phase-7-scripting-and-alarming.md decision #6 row + the "Sandbox escape" compliance-check row now enumerate the authoritative deny-list exactly as ForbiddenTypeAnalyzer defines it (namespace-prefix denies for System.IO / System.Net / System.Diagnostics / System.Reflection / System.Threading.Tasks / System.Runtime.InteropServices / Microsoft.Win32; type-granular denies for System.Environment / System.AppDomain / System.GC / System.Activator / System.Threading.Thread), and the compliance-check row lists the syntactic vectors (typeof / generic arg / cast / is/as / default(T) / array element / declared local) the broadened analyzer covers. docs/VirtualTags.md already documents the same list and is unchanged.

Core.Scripting-010

Field Value
Severity Medium
Category Testing coverage
Location tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ScriptSandboxTests.cs:54
Status Resolved

Description: The sandbox-escape test suite covers only the four obvious vectors (File / Http / Process / Reflection) as direct member-access calls. It does not test: typeof(forbidden), generic type arguments (List<FileInfo>), cast expressions to forbidden types, System.Environment.Exit / FailFast, System.Threading.Thread, System.Runtime.InteropServices, Microsoft.Win32 registry access, Activator, or System.AppDomain. Given that the analyzer is the sole security boundary for in-process untrusted-script execution, the gaps in Core.Scripting-001 and Core.Scripting-002 went undetected precisely because no test exercises those forms. The Phase 7 plan A.6 mandated "sandbox escape tests" but the implemented set is materially narrower than the threat surface.

Recommendation: Add a parameterised escape-test covering every node form in Core.Scripting-002 and every forbidden namespace/member in Core.Scripting-001. Each must assert a ScriptSandboxViolationException (or CompilationErrorException) at compile.

Resolution: Resolved 2026-05-22 — added ScriptSandboxTests cases for System.Threading.Thread, System.Threading.Tasks.Task.Run, System.Runtime.InteropServices.Marshal, and Microsoft.Win32.Registry (the four namespace-deny-list vectors that had no test); the 001/002 vectors (Environment.Exit/FailFast/AppDomain/GC/Activator, typeof, generics, cast, default(T), is/as, array element, declared variable) were already covered by the -001/-002 resolution commits. All 79 tests pass.

Core.Scripting-011

Field Value
Severity Low
Category Testing coverage
Location tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/
Status Resolved

Description: Two source files have no direct test coverage: ScriptContext (Deadband static helper is exercised only indirectly through ScriptSandboxTests, and not for its boundary tolerance behaviour) and ScriptSandbox.Build itself (the ArgumentNullException / ArgumentException guards on contextType at ScriptSandbox.cs:45-48 are never asserted). ScriptLogCompanionSink and ScriptLoggerFactory have tests, but there is no test that a script's ctx.Logger Error emission surfaces via the companion sink end-to-end (factory + sink integration is untested). These are minor gaps but leave guard clauses and the logging integration unverified.

Recommendation: Add unit tests for ScriptSandbox.Build argument validation, for ScriptContext.Deadband at and around the tolerance boundary, and an end-to-end test that a script logging at Error level produces both a scripts-*.log event and a companion Warning event.

Resolution: Resolved 2026-05-23 — added three new test files: ScriptSandboxBuildTests covers the Build null / non-ScriptContext / base-class / concrete-subclass paths; ScriptContextTests locks Deadband boundary semantics (equal-to-tolerance returns false; just-over returns true; symmetric in direction; zero-tolerance returns true only on non-equal; negative tolerance trips on any non-equal); the new Factory_plus_companion_sink_integration_surfaces_script_error_in_both_logs test in ScriptLogCompanionSinkTests wires ScriptLoggerFactory + the companion sink together end-to-end and asserts an Error emission lands in both the scripts sink (at Error) and the main sink (at Warning), each tagged with ScriptName. Suite now 101 green (was 85 before).