Files
lmxopcua/code-reviews/Core.Scripting/findings.md
Joseph Doherty 0a20de728d fix(core-scripting): resolve Low code-review findings (Core.Scripting-005,006,008,009,011)
- Core.Scripting-005: DependencyExtractor.HandleTagCall now recognises
  raw-string literal paths by checking the StringLiteralExpression node
  kind instead of the legacy StringLiteralToken kind.
- Core.Scripting-006: scope CompiledScriptCache failed-compile eviction
  with TryRemove(KeyValuePair) so a racing retry entry is not evicted.
- Core.Scripting-008: document the per-publish assembly accretion as an
  accepted limitation in docs/VirtualTags.md.
- Core.Scripting-009: enumerate the authoritative deny-list (namespace
  prefixes + type-granular denies) in the Phase 7 decision-#6 entry to
  match ForbiddenTypeAnalyzer.
- Core.Scripting-011: pin ScriptSandbox.Build, ScriptContext.Deadband
  boundary semantics, and end-to-end factory + companion-sink
  integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 07:23:42 -04:00

339 lines
21 KiB
Markdown

# Code Review — Core.Scripting
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.Scripting-006 |
| 4 | Error handling & resilience | Core.Scripting-007 |
| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 |
| 6 | Performance & resource management | Core.Scripting-008 |
| 7 | Design-document adherence | Core.Scripting-009 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 |
| 10 | Documentation & comments | No issues found |
## Findings
### Core.Scripting-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Security |
| Location | `ForbiddenTypeAnalyzer.cs:45`, `ScriptSandbox.cs:54` |
| Status | Resolved |
**Description:** `System.Environment` lives in the allowed `System` namespace (it is in
`System.Private.CoreLib`, which is allow-listed for primitives) and is not on the
forbidden-namespace deny-list. Nothing prevents an operator-authored script from calling
`System.Environment.Exit(0)` or `System.Environment.FailFast("...")`. Both terminate the
host process immediately. Because scripted-alarm predicates and virtual-tag scripts run
in-process in the main OPC UA server (decision: "Scripting engine runs in the main .NET 10
server process"), a single malicious or buggy predicate brings down the entire server —
an outage affecting every connected client and every driver. `ScriptSandboxTests` only
pins the *read* path (`Environment.GetEnvironmentVariable`) as an accepted compromise; the
process-killing members are not considered. The whole-process kill far exceeds the
"read-only process state" justification the test comments rely on.
**Recommendation:** The forbidden surface must be member-granular, not namespace-granular,
for types in allowed namespaces. Add an explicit forbidden-member deny-list to
`ForbiddenTypeAnalyzer` covering at minimum `System.Environment.Exit`,
`System.Environment.FailFast`, `System.AppDomain`, `System.GC` (e.g. `GC.Collect`,
`GC.AddMemoryPressure`), and `System.Activator.CreateInstance` (a reflection-equivalent
escape). Reject these in `CheckSymbol` by resolved method symbol, with a test for each.
**Resolution:** Resolved 2026-05-22 — added a type-granular `ForbiddenFullTypeNames`
deny-list (`System.Environment`, `System.AppDomain`, `System.GC`, `System.Activator`) to
`ForbiddenTypeAnalyzer`; `CheckSymbol` now rejects any resolved type symbol whose
fully-qualified name matches, alongside the existing namespace-prefix check, so dangerous
`System`-namespace process-control types are blocked at compile while legitimate `System`
types (Math, String, …) stay usable. Regression tests added in `ScriptSandboxTests` for
`Environment.Exit` / `Environment.FailFast` / `Environment.GetEnvironmentVariable` /
`AppDomain` / `GC.Collect` / `Activator.CreateInstance`.
### Core.Scripting-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `ForbiddenTypeAnalyzer.cs:70` |
| Status | Resolved |
**Description:** The syntax walker only inspects four node kinds:
`ObjectCreationExpressionSyntax`, `InvocationExpressionSyntax` with a member-access target,
`MemberAccessExpressionSyntax`, and bare `IdentifierNameSyntax`. It never visits
`TypeOfExpressionSyntax`, generic type-argument lists (`GenericNameSyntax` /
`TypeArgumentListSyntax`), cast expressions (`CastExpressionSyntax`), `is`/`as` type
patterns, `default(T)` expressions, array-creation element types, or `using`/local
declared types. A script such as `typeof(System.IO.File)`,
`new System.Collections.Generic.List<System.IO.FileInfo>()`,
`(System.IO.Stream)null`, or `default(System.Reflection.Assembly)` references a forbidden
type without ever producing a node the walker examines, so the forbidden-type check is
bypassed. The Phase 7 plan A.6 explicitly calls out `typeof` as a sandbox-escape attempt
that "must fail at compile" — it currently does not.
**Recommendation:** Walk every `TypeSyntax` node (handle `TypeOfExpressionSyntax`,
`CastExpressionSyntax`, generic argument lists, and the type operand of
`IsPatternExpression` / binary `as`). The simplest robust fix is to enumerate all
`DescendantNodes()` and, for any node, resolve both `GetSymbolInfo` and `GetTypeInfo`,
then check the resolved type plus every type argument. Add tests covering `typeof`,
generic arguments, casts, and `default(T)` with forbidden types.
**Resolution:** Resolved 2026-05-22 — `ForbiddenTypeAnalyzer.Analyze` now runs a second
pass that resolves `GetTypeInfo` on every `TypeSyntax` node and recursively unwraps array
element types and generic type arguments, so forbidden types named via `typeof`, generic
arguments (`List<FileInfo>`), casts, `is`/`as` patterns, `default(T)`, array-creation
element types, and explicitly-typed local declarations are all rejected at compile. The
original member/call node-kind switch is kept (deliberately narrow to avoid flagging
inherited members such as `typeof(int).Name`), and a span+type dedupe prevents duplicate
rejections from the two passes. Regression tests added in `ScriptSandboxTests` for each
node form plus over-block guards for allowed generics/`typeof`.
### Core.Scripting-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `TimedScriptEvaluator.cs:9`, `ScriptSandbox.cs:30` |
| Status | Resolved |
**Description:** There is no bound on memory a script may allocate or on the number of
threads/tasks a script may spawn. The class docs acknowledge unbounded memory as "a budget
concern" deferred to v3, but in-process execution means a script doing
`new byte[int.MaxValue]` repeatedly (or `Enumerable.Range(0,int.MaxValue).ToList()` — LINQ
is allow-listed) can drive the whole server to `OutOfMemoryException`, an outage. The
timeout does not help: the allocation can exhaust memory well before 250ms elapses, and
the orphaned thread-pool thread documented in `TimedScriptEvaluator` keeps the allocation
rooted. `System.Threading.Tasks` is not on the deny-list, so a script can also
`Task.Run` an unbounded fan-out of background work that outlives the timeout entirely.
**Recommendation:** At minimum, document this as a known accepted risk in
`docs/ScriptedAlarms.md` / `docs/VirtualTags.md` rather than only in a code comment, and
add the `Task`/`Parallel` namespaces to the forbidden list (scripts are synchronous
predicates — they have no legitimate need to start background tasks). For memory, gate
script authoring behind an Admin permission and treat the test-harness preview as the
control point, or track an explicit v3 issue for out-of-process execution. Record the
decision so it is not silently lost.
**Resolution:** Resolved 2026-05-22 — added `System.Threading.Tasks` to `ForbiddenNamespacePrefixes` (blocking `Task.Run` / `Parallel` fan-out); documented the unbounded-memory accepted risk and the `Task` denial rationale in `docs/VirtualTags.md` (new "Known resource limits" subsection) and cross-referenced from `docs/ScriptedAlarms.md`.
### Core.Scripting-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `DependencyExtractor.cs:73` |
| Status | Resolved |
**Description:** The walker matches tag-access calls purely by spelling — any
`InvocationExpressionSyntax` whose member name is `GetTag` or `SetVirtualTag` is treated as
a `ScriptContext` tag access, regardless of the receiver. A script that defines a local
type with a `GetTag(string)` method and calls `other.GetTag("X")`, or calls
`this.GetTag(...)` on a script-defined helper, has spurious dependencies harvested (or, if
the literal arg is non-literal, spurious rejections raised). The XML remarks claim "as long
as it's not on the ctx instance, the extractor doesn't pick it up", but the code does not
check that the receiver is the `ctx` identifier — it accepts any member access with the
matching name. The `DependencyExtractorTests.Ignores_non_ctx_method_named_GetTag` test
passes only because the helper there is a *free* function (not member-access form); a
member-access call to a non-ctx `GetTag` is untested and would be misattributed.
**Recommendation:** In `VisitInvocationExpression`, additionally require that
`member.Expression` is an `IdentifierNameSyntax` with `Identifier.ValueText == "ctx"`
(matching the `ScriptGlobals<TContext>.ctx` field name). Add a test for
`someOtherObject.GetTag("X")` asserting it is ignored.
**Resolution:** Resolved 2026-05-22 — `VisitInvocationExpression` now additionally checks that `member.Expression` is an `IdentifierNameSyntax` with `ValueText == "ctx"` before treating the call as a dependency; test `Ignores_member_access_GetTag_on_non_ctx_receiver` added to `DependencyExtractorTests`.
### Core.Scripting-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `DependencyExtractor.cs:97` |
| Status | Resolved |
**Description:** A raw string literal token passed as the tag path (a raw triple-quote
literal) tokenizes as `SingleLineRawStringLiteralToken` /
`MultiLineRawStringLiteralToken`, not `StringLiteralToken`. The check
`literal.Token.IsKind(SyntaxKind.StringLiteralToken)` therefore rejects an
otherwise-static raw-string path as a non-literal "dynamic path", producing a misleading
rejection message. This is an edge case (operators rarely write raw strings for tag
paths) but the error text would confuse anyone who does.
**Recommendation:** Accept all string-literal token kinds — check
`literal.IsKind(SyntaxKind.StringLiteralExpression)` on the expression node, or include
the raw-string token kinds, so a static raw string is harvested rather than rejected.
**Resolution:** Resolved 2026-05-23 — `HandleTagCall` now checks `literal.IsKind(SyntaxKind.StringLiteralExpression)` on the expression node, which covers regular string literals, single-line raw strings, and multi-line raw strings uniformly. Regression tests `Accepts_single_line_raw_string_literal_path` and `Accepts_multi_line_raw_string_literal_path` added to `DependencyExtractorTests`.
### Core.Scripting-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `CompiledScriptCache.cs:55` |
| Status | Resolved |
**Description:** On a failed compile the `catch` block calls
`_cache.TryRemove(key, out _)` without a value comparison. If two threads race a miss for
the same bad source, both observe the same faulted `Lazy` and throw, and both call
`TryRemove(key)`. If a concurrent retry re-adds a new `Lazy` for that key between the two
removals, the second unconditional `TryRemove` could evict the in-flight retry entry. The
window is small and the consequence is only a redundant recompile, so severity is Low —
but the removal should be key+value scoped for correctness.
**Recommendation:** Use the `ConcurrentDictionary.TryRemove(KeyValuePair<,>)` overload to
remove only the specific faulted `Lazy` instance, so a concurrently re-added entry is not
evicted.
**Resolution:** Resolved 2026-05-23 — `GetOrCompile`'s catch block now evicts via `_cache.TryRemove(new KeyValuePair<string, Lazy<…>>(key, lazy))`, comparing the value reference so only the faulted Lazy is removed; a concurrent retry that re-inserted a fresh Lazy under the same key is preserved. Regression test `Failed_compile_eviction_does_not_remove_a_concurrent_retry_entry` added to `CompiledScriptCacheTests` (reflection-driven deterministic race: the faulted Lazy's factory swaps the dictionary entry to a fresh Lazy as a side effect of its throw, modelling the precise race window).
### Core.Scripting-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `TimedScriptEvaluator.cs:60` |
| Status | Resolved |
**Description:** `RunAsync` wraps the inner run in `Task.Run(...)` and then awaits
`WaitAsync(Timeout, ct)`. If the caller-supplied `ct` cancels at roughly the same time the
timeout elapses, the order in which `WaitAsync` observes the timeout vs. the cancellation
is non-deterministic, so the same shutdown can sometimes surface as
`ScriptTimeoutException` and sometimes as `OperationCanceledException`. The class docs
assert "the caller's cancel wins" as a hard guarantee that the virtual-tag engine shutdown
path depends on to avoid misclassifying shutdown as a script fault — but the
implementation does not guarantee it when both fire close together.
**Recommendation:** After catching `TimeoutException`, check `ct.IsCancellationRequested`
and throw `OperationCanceledException(ct)` instead of `ScriptTimeoutException` when the
caller's token is cancelled, so caller cancellation deterministically wins regardless of
race ordering.
**Resolution:** Resolved 2026-05-22 — in the `catch (TimeoutException)` handler, `ct.IsCancellationRequested` is now checked and `OperationCanceledException(ct)` thrown before `ScriptTimeoutException`, so caller cancellation deterministically wins regardless of race ordering; regression test `Caller_cancellation_wins_even_when_timeout_fires_first` added to `TimedScriptEvaluatorTests`.
### Core.Scripting-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` |
| Status | Open |
**Description:** `CompiledScriptCache` has no capacity bound (acknowledged in the class
remarks) and no eviction. Each cached `ScriptEvaluator` holds a Roslyn `ScriptRunner<T>`
delegate, which keeps the dynamically emitted script assembly loaded for the process
lifetime — emitted assemblies in the default `AssemblyLoadContext` cannot be unloaded.
`Clear()` drops the dictionary entries but does **not** unload the emitted assemblies;
they leak. Across many config-generation publishes (each `Clear()` followed by recompiling
every script), the process accumulates dead script assemblies. For the expected "low
thousands" of scripts this is benign, but a long-running server with frequent publishes
will see steady managed-memory growth that never returns.
**Recommendation:** Document the per-publish assembly accretion as a known limitation, or
compile scripts into a collectible `AssemblyLoadContext` so `Clear()` can unload prior
generations. At minimum add a note to `docs/ScriptedAlarms.md` so operators with
high-publish-frequency deployments are aware.
**Resolution:** Resolved 2026-05-23 — accepted as a documented known limitation rather than fixing in code (collectible `AssemblyLoadContext` for Roslyn-emitted assemblies is a v3 concern). The "Compile cache" section of `docs/VirtualTags.md` now carries a "Per-publish assembly accretion (accepted limitation, Core.Scripting-008)" note that operators with high-publish-frequency deployments can scan, and `docs/ScriptedAlarms.md` cross-references it. The accretion is benign at the expected "low thousands" of scripts scale; recommended mitigation is a scheduled server restart for deployments that publish very frequently.
### Core.Scripting-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ForbiddenTypeAnalyzer.cs:45` |
| Status | Resolved |
**Description:** The Phase 7 plan decision #6
(`docs/v2/implementation/phase-7-scripting-and-alarming.md`) enumerates the forbidden
surface as "No HttpClient / File / Process / reflection". `ForbiddenTypeAnalyzer` actually
denies a broader set — `System.Threading.Thread`, `System.Runtime.InteropServices`, and
`Microsoft.Win32` (registry) — which is sensible hardening but is undocumented in the plan
and in `docs/ScriptedAlarms.md` (which defers sandbox rules to `VirtualTags.md`). An
operator reading the design docs cannot predict that a registry or interop reference will
be rejected. Conversely the plan does not record the `System.Environment` /
`System.Diagnostics` decisions. The code and the design document have drifted.
**Recommendation:** Update the plan's decision #6 (or `docs/VirtualTags.md`) to list the
authoritative deny-list exactly as `ForbiddenTypeAnalyzer.ForbiddenNamespacePrefixes`
defines it, including the `System.Environment` allowed-compromise, so the docs match the
code.
**Resolution:** Resolved 2026-05-23 — `docs/v2/implementation/phase-7-scripting-and-alarming.md` decision #6 row + the "Sandbox escape" compliance-check row now enumerate the authoritative deny-list exactly as `ForbiddenTypeAnalyzer` defines it (namespace-prefix denies for `System.IO` / `System.Net` / `System.Diagnostics` / `System.Reflection` / `System.Threading.Tasks` / `System.Runtime.InteropServices` / `Microsoft.Win32`; type-granular denies for `System.Environment` / `System.AppDomain` / `System.GC` / `System.Activator` / `System.Threading.Thread`), and the compliance-check row lists the syntactic vectors (`typeof` / generic arg / cast / `is`/`as` / `default(T)` / array element / declared local) the broadened analyzer covers. `docs/VirtualTags.md` already documents the same list and is unchanged.
### Core.Scripting-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ScriptSandboxTests.cs:54` |
| Status | Resolved |
**Description:** The sandbox-escape test suite covers only the four obvious vectors
(File / Http / Process / Reflection) as direct member-access calls. It does not test:
`typeof(forbidden)`, generic type arguments (`List<FileInfo>`), cast expressions to
forbidden types, `System.Environment.Exit` / `FailFast`, `System.Threading.Thread`,
`System.Runtime.InteropServices`, `Microsoft.Win32` registry access, `Activator`, or
`System.AppDomain`. Given that the analyzer is the sole security boundary for in-process
untrusted-script execution, the gaps in Core.Scripting-001 and Core.Scripting-002 went
undetected precisely because no test exercises those forms. The Phase 7 plan A.6 mandated
"sandbox escape tests" but the implemented set is materially narrower than the threat
surface.
**Recommendation:** Add a parameterised escape-test covering every node form in
Core.Scripting-002 and every forbidden namespace/member in Core.Scripting-001. Each must
assert a `ScriptSandboxViolationException` (or `CompilationErrorException`) at compile.
**Resolution:** Resolved 2026-05-22 — added `ScriptSandboxTests` cases for `System.Threading.Thread`, `System.Threading.Tasks.Task.Run`, `System.Runtime.InteropServices.Marshal`, and `Microsoft.Win32.Registry` (the four namespace-deny-list vectors that had no test); the 001/002 vectors (Environment.Exit/FailFast/AppDomain/GC/Activator, typeof, generics, cast, default(T), is/as, array element, declared variable) were already covered by the -001/-002 resolution commits. All 79 tests pass.
### Core.Scripting-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/` |
| Status | Resolved |
**Description:** Two source files have no direct test coverage: `ScriptContext`
(`Deadband` static helper is exercised only indirectly through `ScriptSandboxTests`, and
not for its boundary `tolerance` behaviour) and `ScriptSandbox.Build` itself (the
`ArgumentNullException` / `ArgumentException` guards on `contextType` at
`ScriptSandbox.cs:45-48` are never asserted). `ScriptLogCompanionSink` and
`ScriptLoggerFactory` have tests, but there is no test that a script's `ctx.Logger` Error
emission surfaces via the companion sink end-to-end (factory + sink integration is
untested). These are minor gaps but leave guard clauses and the logging integration
unverified.
**Recommendation:** Add unit tests for `ScriptSandbox.Build` argument validation, for
`ScriptContext.Deadband` at and around the tolerance boundary, and an end-to-end test that
a script logging at Error level produces both a `scripts-*.log` event and a companion
Warning event.
**Resolution:** Resolved 2026-05-23 — added three new test files: `ScriptSandboxBuildTests` covers the `Build` null / non-`ScriptContext` / base-class / concrete-subclass paths; `ScriptContextTests` locks `Deadband` boundary semantics (equal-to-tolerance returns false; just-over returns true; symmetric in direction; zero-tolerance returns true only on non-equal; negative tolerance trips on any non-equal); the new `Factory_plus_companion_sink_integration_surfaces_script_error_in_both_logs` test in `ScriptLogCompanionSinkTests` wires `ScriptLoggerFactory` + the companion sink together end-to-end and asserts an Error emission lands in both the scripts sink (at Error) and the main sink (at Warning), each tagged with `ScriptName`. Suite now 101 green (was 85 before).