docs(code-reviews): comprehensive per-module review pass at 76d35d1

Reviewed all 31 src/ production projects against the 10-category checklist in REVIEW-PROCESS.md. Each module gets its own findings.md; code-reviews/README.md is regenerated from them. 334 findings: 6 Critical, 46 High, 126 Medium, 156 Low. Critical findings: - Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead on an unresolvable node crashes the process (remote DoS). - Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView) plus unauthenticated mutating routes. - Core.Scripting-001: System.Environment reachable from operator scripts; Environment.Exit() terminates the server. - Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt payload misapplies outcomes — silent alarm-event data loss. - Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so a transient gateway drop permanently kills the event stream. All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md section 4. Review only — no source code changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:20:27 -04:00
parent 76d35d1b9f
commit 8568f5cd85
32 changed files with 8134 additions and 2 deletions
@@ -0,0 +1,323 @@
+# Code Review — Core.Scripting
+
+| Field | Value |
+|---|---|
+| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting` |
+| Reviewer | Claude Code |
+| Review date | 2026-05-22 |
+| Commit reviewed | `76d35d1` |
+| Status | Reviewed |
+| Open findings | 11 |
+
+## Checklist coverage
+
+A comprehensive review completes every category, recording "No issues found" where
+a category produced nothing rather than leaving it blank.
+
+| # | Category | Result |
+|---|---|---|
+| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 |
+| 2 | OtOpcUa conventions | No issues found |
+| 3 | Concurrency & thread safety | Core.Scripting-006 |
+| 4 | Error handling & resilience | Core.Scripting-007 |
+| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 |
+| 6 | Performance & resource management | Core.Scripting-008 |
+| 7 | Design-document adherence | Core.Scripting-009 |
+| 8 | Code organization & conventions | No issues found |
+| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 |
+| 10 | Documentation & comments | No issues found |
+
+## Findings
+
+### Core.Scripting-001
+
+| Field | Value |
+|---|---|
+| Severity | Critical |
+| Category | Security |
+| Location | `ForbiddenTypeAnalyzer.cs:45`, `ScriptSandbox.cs:54` |
+| Status | Open |
+
+**Description:** `System.Environment` lives in the allowed `System` namespace (it is in
+`System.Private.CoreLib`, which is allow-listed for primitives) and is not on the
+forbidden-namespace deny-list. Nothing prevents an operator-authored script from calling
+`System.Environment.Exit(0)` or `System.Environment.FailFast("...")`. Both terminate the
+host process immediately. Because scripted-alarm predicates and virtual-tag scripts run
+in-process in the main OPC UA server (decision: "Scripting engine runs in the main .NET 10
+server process"), a single malicious or buggy predicate brings down the entire server —
+an outage affecting every connected client and every driver. `ScriptSandboxTests` only
+pins the *read* path (`Environment.GetEnvironmentVariable`) as an accepted compromise; the
+process-killing members are not considered. The whole-process kill far exceeds the
+"read-only process state" justification the test comments rely on.
+
+**Recommendation:** The forbidden surface must be member-granular, not namespace-granular,
+for types in allowed namespaces. Add an explicit forbidden-member deny-list to
+`ForbiddenTypeAnalyzer` covering at minimum `System.Environment.Exit`,
+`System.Environment.FailFast`, `System.AppDomain`, `System.GC` (e.g. `GC.Collect`,
+`GC.AddMemoryPressure`), and `System.Activator.CreateInstance` (a reflection-equivalent
+escape). Reject these in `CheckSymbol` by resolved method symbol, with a test for each.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-002
+
+| Field | Value |
+|---|---|
+| Severity | High |
+| Category | Security |
+| Location | `ForbiddenTypeAnalyzer.cs:70` |
+| Status | Open |
+
+**Description:** The syntax walker only inspects four node kinds:
+`ObjectCreationExpressionSyntax`, `InvocationExpressionSyntax` with a member-access target,
+`MemberAccessExpressionSyntax`, and bare `IdentifierNameSyntax`. It never visits
+`TypeOfExpressionSyntax`, generic type-argument lists (`GenericNameSyntax` /
+`TypeArgumentListSyntax`), cast expressions (`CastExpressionSyntax`), `is`/`as` type
+patterns, `default(T)` expressions, array-creation element types, or `using`/local
+declared types. A script such as `typeof(System.IO.File)`,
+`new System.Collections.Generic.List<System.IO.FileInfo>()`,
+`(System.IO.Stream)null`, or `default(System.Reflection.Assembly)` references a forbidden
+type without ever producing a node the walker examines, so the forbidden-type check is
+bypassed. The Phase 7 plan A.6 explicitly calls out `typeof` as a sandbox-escape attempt
+that "must fail at compile" — it currently does not.
+
+**Recommendation:** Walk every `TypeSyntax` node (handle `TypeOfExpressionSyntax`,
+`CastExpressionSyntax`, generic argument lists, and the type operand of
+`IsPatternExpression` / binary `as`). The simplest robust fix is to enumerate all
+`DescendantNodes()` and, for any node, resolve both `GetSymbolInfo` and `GetTypeInfo`,
+then check the resolved type plus every type argument. Add tests covering `typeof`,
+generic arguments, casts, and `default(T)` with forbidden types.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-003
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Security |
+| Location | `TimedScriptEvaluator.cs:9`, `ScriptSandbox.cs:30` |
+| Status | Open |
+
+**Description:** There is no bound on memory a script may allocate or on the number of
+threads/tasks a script may spawn. The class docs acknowledge unbounded memory as "a budget
+concern" deferred to v3, but in-process execution means a script doing
+`new byte[int.MaxValue]` repeatedly (or `Enumerable.Range(0,int.MaxValue).ToList()` — LINQ
+is allow-listed) can drive the whole server to `OutOfMemoryException`, an outage. The
+timeout does not help: the allocation can exhaust memory well before 250ms elapses, and
+the orphaned thread-pool thread documented in `TimedScriptEvaluator` keeps the allocation
+rooted. `System.Threading.Tasks` is not on the deny-list, so a script can also
+`Task.Run` an unbounded fan-out of background work that outlives the timeout entirely.
+
+**Recommendation:** At minimum, document this as a known accepted risk in
+`docs/ScriptedAlarms.md` / `docs/VirtualTags.md` rather than only in a code comment, and
+add the `Task`/`Parallel` namespaces to the forbidden list (scripts are synchronous
+predicates — they have no legitimate need to start background tasks). For memory, gate
+script authoring behind an Admin permission and treat the test-harness preview as the
+control point, or track an explicit v3 issue for out-of-process execution. Record the
+decision so it is not silently lost.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-004
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Location | `DependencyExtractor.cs:73` |
+| Status | Open |
+
+**Description:** The walker matches tag-access calls purely by spelling — any
+`InvocationExpressionSyntax` whose member name is `GetTag` or `SetVirtualTag` is treated as
+a `ScriptContext` tag access, regardless of the receiver. A script that defines a local
+type with a `GetTag(string)` method and calls `other.GetTag("X")`, or calls
+`this.GetTag(...)` on a script-defined helper, has spurious dependencies harvested (or, if
+the literal arg is non-literal, spurious rejections raised). The XML remarks claim "as long
+as it's not on the ctx instance, the extractor doesn't pick it up", but the code does not
+check that the receiver is the `ctx` identifier — it accepts any member access with the
+matching name. The `DependencyExtractorTests.Ignores_non_ctx_method_named_GetTag` test
+passes only because the helper there is a *free* function (not member-access form); a
+member-access call to a non-ctx `GetTag` is untested and would be misattributed.
+
+**Recommendation:** In `VisitInvocationExpression`, additionally require that
+`member.Expression` is an `IdentifierNameSyntax` with `Identifier.ValueText == "ctx"`
+(matching the `ScriptGlobals<TContext>.ctx` field name). Add a test for
+`someOtherObject.GetTag("X")` asserting it is ignored.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-005
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Location | `DependencyExtractor.cs:97` |
+| Status | Open |
+
+**Description:** A raw string literal token passed as the tag path (a raw triple-quote
+literal) tokenizes as `SingleLineRawStringLiteralToken` /
+`MultiLineRawStringLiteralToken`, not `StringLiteralToken`. The check
+`literal.Token.IsKind(SyntaxKind.StringLiteralToken)` therefore rejects an
+otherwise-static raw-string path as a non-literal "dynamic path", producing a misleading
+rejection message. This is an edge case (operators rarely write raw strings for tag
+paths) but the error text would confuse anyone who does.
+
+**Recommendation:** Accept all string-literal token kinds — check
+`literal.IsKind(SyntaxKind.StringLiteralExpression)` on the expression node, or include
+the raw-string token kinds, so a static raw string is harvested rather than rejected.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-006
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Location | `CompiledScriptCache.cs:55` |
+| Status | Open |
+
+**Description:** On a failed compile the `catch` block calls
+`_cache.TryRemove(key, out _)` without a value comparison. If two threads race a miss for
+the same bad source, both observe the same faulted `Lazy` and throw, and both call
+`TryRemove(key)`. If a concurrent retry re-adds a new `Lazy` for that key between the two
+removals, the second unconditional `TryRemove` could evict the in-flight retry entry. The
+window is small and the consequence is only a redundant recompile, so severity is Low —
+but the removal should be key+value scoped for correctness.
+
+**Recommendation:** Use the `ConcurrentDictionary.TryRemove(KeyValuePair<,>)` overload to
+remove only the specific faulted `Lazy` instance, so a concurrently re-added entry is not
+evicted.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-007
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Location | `TimedScriptEvaluator.cs:60` |
+| Status | Open |
+
+**Description:** `RunAsync` wraps the inner run in `Task.Run(...)` and then awaits
+`WaitAsync(Timeout, ct)`. If the caller-supplied `ct` cancels at roughly the same time the
+timeout elapses, the order in which `WaitAsync` observes the timeout vs. the cancellation
+is non-deterministic, so the same shutdown can sometimes surface as
+`ScriptTimeoutException` and sometimes as `OperationCanceledException`. The class docs
+assert "the caller's cancel wins" as a hard guarantee that the virtual-tag engine shutdown
+path depends on to avoid misclassifying shutdown as a script fault — but the
+implementation does not guarantee it when both fire close together.
+
+**Recommendation:** After catching `TimeoutException`, check `ct.IsCancellationRequested`
+and throw `OperationCanceledException(ct)` instead of `ScriptTimeoutException` when the
+caller's token is cancelled, so caller cancellation deterministically wins regardless of
+race ordering.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-008
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Performance & resource management |
+| Location | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` |
+| Status | Open |
+
+**Description:** `CompiledScriptCache` has no capacity bound (acknowledged in the class
+remarks) and no eviction. Each cached `ScriptEvaluator` holds a Roslyn `ScriptRunner<T>`
+delegate, which keeps the dynamically emitted script assembly loaded for the process
+lifetime — emitted assemblies in the default `AssemblyLoadContext` cannot be unloaded.
+`Clear()` drops the dictionary entries but does **not** unload the emitted assemblies;
+they leak. Across many config-generation publishes (each `Clear()` followed by recompiling
+every script), the process accumulates dead script assemblies. For the expected "low
+thousands" of scripts this is benign, but a long-running server with frequent publishes
+will see steady managed-memory growth that never returns.
+
+**Recommendation:** Document the per-publish assembly accretion as a known limitation, or
+compile scripts into a collectible `AssemblyLoadContext` so `Clear()` can unload prior
+generations. At minimum add a note to `docs/ScriptedAlarms.md` so operators with
+high-publish-frequency deployments are aware.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-009
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Design-document adherence |
+| Location | `ForbiddenTypeAnalyzer.cs:45` |
+| Status | Open |
+
+**Description:** The Phase 7 plan decision #6
+(`docs/v2/implementation/phase-7-scripting-and-alarming.md`) enumerates the forbidden
+surface as "No HttpClient / File / Process / reflection". `ForbiddenTypeAnalyzer` actually
+denies a broader set — `System.Threading.Thread`, `System.Runtime.InteropServices`, and
+`Microsoft.Win32` (registry) — which is sensible hardening but is undocumented in the plan
+and in `docs/ScriptedAlarms.md` (which defers sandbox rules to `VirtualTags.md`). An
+operator reading the design docs cannot predict that a registry or interop reference will
+be rejected. Conversely the plan does not record the `System.Environment` /
+`System.Diagnostics` decisions. The code and the design document have drifted.
+
+**Recommendation:** Update the plan's decision #6 (or `docs/VirtualTags.md`) to list the
+authoritative deny-list exactly as `ForbiddenTypeAnalyzer.ForbiddenNamespacePrefixes`
+defines it, including the `System.Environment` allowed-compromise, so the docs match the
+code.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-010
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Testing coverage |
+| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ScriptSandboxTests.cs:54` |
+| Status | Open |
+
+**Description:** The sandbox-escape test suite covers only the four obvious vectors
+(File / Http / Process / Reflection) as direct member-access calls. It does not test:
+`typeof(forbidden)`, generic type arguments (`List<FileInfo>`), cast expressions to
+forbidden types, `System.Environment.Exit` / `FailFast`, `System.Threading.Thread`,
+`System.Runtime.InteropServices`, `Microsoft.Win32` registry access, `Activator`, or
+`System.AppDomain`. Given that the analyzer is the sole security boundary for in-process
+untrusted-script execution, the gaps in Core.Scripting-001 and Core.Scripting-002 went
+undetected precisely because no test exercises those forms. The Phase 7 plan A.6 mandated
+"sandbox escape tests" but the implemented set is materially narrower than the threat
+surface.
+
+**Recommendation:** Add a parameterised escape-test covering every node form in
+Core.Scripting-002 and every forbidden namespace/member in Core.Scripting-001. Each must
+assert a `ScriptSandboxViolationException` (or `CompilationErrorException`) at compile.
+
+**Resolution:** _(open)_
+
+### Core.Scripting-011
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Testing coverage |
+| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/` |
+| Status | Open |
+
+**Description:** Two source files have no direct test coverage: `ScriptContext`
+(`Deadband` static helper is exercised only indirectly through `ScriptSandboxTests`, and
+not for its boundary `tolerance` behaviour) and `ScriptSandbox.Build` itself (the
+`ArgumentNullException` / `ArgumentException` guards on `contextType` at
+`ScriptSandbox.cs:45-48` are never asserted). `ScriptLogCompanionSink` and
+`ScriptLoggerFactory` have tests, but there is no test that a script's `ctx.Logger` Error
+emission surfaces via the companion sink end-to-end (factory + sink integration is
+untested). These are minor gaps but leave guard clauses and the logging integration
+unverified.
+
+**Recommendation:** Add unit tests for `ScriptSandbox.Build` argument validation, for
+`ScriptContext.Deadband` at and around the tolerance boundary, and an end-to-end test that
+a script logging at Error level produces both a `scripts-*.log` event and a companion
+Warning event.
+
+**Resolution:** _(open)_