docs(code-reviews): updated re-review at commit a9be809 — 12 new findings

Re-reviewed the four modules with source changes since the previous review commit 76d35d1, per REVIEW-PROCESS.md section 6. Updated each findings.md header (date 2026-05-23, commit a9be809) and appended new findings under continued numbering. Regenerated README.md. ## New findings — 12 total across 4 modules ### Core.Scripting (5 new, IDs -012 to -016) - **-012 High Security** — broadened BCL references (System.* + netstandard) re-expose System.Threading.ThreadPool / Timer / AssemblyLoadContext, which the analyzer's deny-list doesn't cover. Re-introduces the background-work threat Core.Scripting-003 closed via System.Threading.Tasks deny. - **-013 Medium Security** — hand-rolled wrapper-source generation lets brace-balanced user source inject sibling methods/classes alongside CompiledScript.Run. Analyzer still gates forbidden types, but the documented 'method body' authoring contract is silently relaxed. - **-014 Medium Concurrency** — CompiledScriptCache.Clear() uses key-only TryRemove(key, out _) — the same race the -006 resolution fixed in GetOrCompile's catch is latent here on publish-replace. - **-015 Low Correctness** — ToCSharpTypeName truncates at first backtick; silently drops closed type arguments of nested-generic shapes (Outer<>.Inner<>). Latent — no production caller uses this shape today. - **-016 Medium Performance** — VirtualTagEngine + ScriptedAlarmEngine call ScriptEvaluator.Compile directly without going through CompiledScriptCache, so the headline -008 collectible-ALC fix doesn't run on the actual production path — the per-publish leak is still in effect. ### Core.ScriptedAlarms (1 new, ID -013) - **-013 Low Documentation** — new internal test accessors return the live mutable scratch dictionary; XML docs don't warn future test authors about the synchronisation contract. ### Driver.Cli.Common (2 new, IDs -007, -008) - **-007 High Correctness** — 0x80550000 was added as BadDeviceFailure but the real OPC UA spec value for BadDeviceFailure is 0x808B0000 (verified against Driver.Galaxy.Runtime.StatusCodeMap and HistorianQualityMapper, both of which use the correct 0x808B0000). 0x80550000 is actually BadSecurityPolicyRejected. The native mappers (FOCAS / AbCip / AbLegacy) all use the wrong 0x80550000; this session's SnapshotFormatter extension propagated the wrong name and the test asserts against the same wrong value so CI is blind — same shape of bug as Driver.Cli.Common-001. - **-008 Low Testing** — new FormatStatus_names_native_driver_emitted_codes Theory is redundant with the existing well-known Theory (same five InlineData rows added to both) and uses weaker ShouldContain assertion than the well-known Theory's ShouldBe. ### Driver.Galaxy (4 new, IDs -015 to -018) - **-015 Medium Security** — vendored DLLs (libs/) have no recorded provenance: no source-commit SHA from the mxaccessgw repo, no SHA-256 checksum in libs/README.md. Tampering / accidental swap undetectable. - **-016 Medium Performance** — version skew between declared PackageReferences (Polly 8.5.2 / Grpc.Net.Client 2.71.0 / Microsoft.Extensions.Logging.Abstractions 10.0.0) and what the vendored DLL was actually built against (Polly.Core 8.6.6 / Grpc.Net.Client 2.76.0 / Microsoft.Extensions.Logging.Abstractions 10.0.7). Latent now (assembly-version refs are loose) but precise shape that produces a runtime MissingMethodException. - **-017 Low Design** — no contract-version handshake between the driver and the gateway; proto could evolve under the gateway without the driver noticing. - **-018 Low Documentation** — libs/README.md points at the wrong sibling csproj as the version source-of-truth; missing SpecificVersion=false on the Reference items; missing mxaccessgw source-commit SHA. ## Particularly notable Two findings undercut commits from this session: - Driver.Cli.Common-007 invalidates commit 5a9c459 (which named 0x80550000 as BadDeviceFailure across the cross-CLI shortlist). - Core.Scripting-016 invalidates the production effect of commit 7b6ab2e (the collectible-ALC fix wired Dispose only via CompiledScriptCache, which the engines don't use). The wider native-mapper miscoding behind -007 also affects three driver modules outside this session's edit scope (FocasStatusMapper, AbCipStatusMapper, AbLegacyStatusMapper all carry the wrong code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:02:47 -04:00
parent a9be80923c
commit 41e62b2663
5 changed files with 594 additions and 35 deletions
@@ -4,28 +4,33 @@
 |---|---|
 | Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting` |
 | Reviewer | Claude Code |
-| Review date | 2026-05-22 |
-| Commit reviewed | `76d35d1` |
+| Review date | 2026-05-23 |
+| Commit reviewed | `a9be809` |
 | Status | Reviewed |
-| Open findings | 0 |
+| Open findings | 5 |

 ## Checklist coverage

 A comprehensive review completes every category, recording "No issues found" where
 a category produced nothing rather than leaving it blank.

-| # | Category | Result |
-|---|---|---|
-| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 |
-| 2 | OtOpcUa conventions | No issues found |
-| 3 | Concurrency & thread safety | Core.Scripting-006 |
-| 4 | Error handling & resilience | Core.Scripting-007 |
-| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 |
-| 6 | Performance & resource management | Core.Scripting-008 |
-| 7 | Design-document adherence | Core.Scripting-009 |
-| 8 | Code organization & conventions | No issues found |
-| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 |
-| 10 | Documentation & comments | No issues found |
+The 2026-05-23 re-review only covers code touched between commits `76d35d1` and
+`a9be809` (primarily the Core.Scripting-008 ALC rewrite + the broadened BCL
+references). Categories where the new code surface produced no issues are
+recorded as "No new issues" for that pass.
+
+| # | Category | Result (76d35d1) | Result (a9be809, new code only) |
+|---|---|---|---|
+| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 | Core.Scripting-015 |
+| 2 | OtOpcUa conventions | No issues found | No new issues |
+| 3 | Concurrency & thread safety | Core.Scripting-006 | Core.Scripting-014 |
+| 4 | Error handling & resilience | Core.Scripting-007 | No new issues |
+| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 | Core.Scripting-012, Core.Scripting-013 |
+| 6 | Performance & resource management | Core.Scripting-008 | Core.Scripting-016 |
+| 7 | Design-document adherence | Core.Scripting-009 | No new issues |
+| 8 | Code organization & conventions | No issues found | No new issues |
+| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 | No new issues |
+| 10 | Documentation & comments | No issues found | No new issues |

 ## Findings

@@ -362,3 +367,294 @@ a script logging at Error level produces both a `scripts-*.log` event and a comp
 Warning event.

 **Resolution:** Resolved 2026-05-23 — added three new test files: `ScriptSandboxBuildTests` covers the `Build` null / non-`ScriptContext` / base-class / concrete-subclass paths; `ScriptContextTests` locks `Deadband` boundary semantics (equal-to-tolerance returns false; just-over returns true; symmetric in direction; zero-tolerance returns true only on non-equal; negative tolerance trips on any non-equal); the new `Factory_plus_companion_sink_integration_surfaces_script_error_in_both_logs` test in `ScriptLogCompanionSinkTests` wires `ScriptLoggerFactory` + the companion sink together end-to-end and asserts an Error emission lands in both the scripts sink (at Error) and the main sink (at Warning), each tagged with `ScriptName`. Suite now 101 green (was 85 before).
+
+### Core.Scripting-012
+
+| Field | Value |
+|---|---|
+| Severity | High |
+| Category | Security |
+| Location | `ForbiddenTypeAnalyzer.cs:60-76`, `ScriptSandbox.cs:96-126` |
+| Status | Open |
+
+**Description:** The Core.Scripting-008 rewrite broadened the BCL references list
+from a narrow allow-list (`System.Private.CoreLib` + `System.Linq` only) to the
+full `TRUSTED_PLATFORM_ASSEMBLIES` set filtered to `System.*` + `netstandard` +
+`Microsoft.Win32.Registry`. This change correctly delegates the security gate to
+`ForbiddenTypeAnalyzer` (the new comment in `ScriptSandbox` calls this out
+explicitly), but the analyzer's deny-list has not been expanded to match the new
+attack surface, and three categories of dangerous BCL types in the `System.*`
+allow-listed assemblies are now reachable from script source:
+
+1. **`System.Threading.ThreadPool`** (in namespace `System.Threading`). The
+   Core.Scripting-003 fix added `System.Threading.Tasks` to deny `Task.Run` /
+   `Parallel` fan-out because background work that outlives the per-evaluation
+   timeout is the explicit threat. `ThreadPool.QueueUserWorkItem`,
+   `ThreadPool.UnsafeQueueUserWorkItem`, and `ThreadPool.RegisterWaitForSingleObject`
+   are exactly the same threat — they schedule background work that outlives the
+   `WaitAsync(Timeout)` budget and tie up worker threads — but `System.Threading`
+   itself is allowed (because `CancellationToken` / `SemaphoreSlim` / `Volatile`
+   live there). The Core.Scripting-003 resolution is incomplete on the new
+   reference surface.
+2. **`System.Threading.Timer`** (same namespace). Schedules a background
+   callback; the script returns control to the engine but the timer keeps
+   firing past the evaluation budget. Same threat as `Task.Run`.
+3. **`System.Runtime.Loader.AssemblyLoadContext`** (in namespace
+   `System.Runtime.Loader`, which is not denied — only `System.Runtime.InteropServices`
+   is). The constructor + `LoadFromAssemblyPath` / `LoadFromStream` /
+   `LoadFromAssemblyName` let a script load an arbitrary DLL into the host
+   process. Pass (1) of the analyzer resolves the receiver type
+   (`AssemblyLoadContext`, allowed) + the invocation symbol's containing type
+   (also `AssemblyLoadContext`, allowed) and lets the call through. Pass (2)
+   only inspects `TypeSyntax` nodes — if the script discards the returned
+   `Assembly` (e.g. `alc.LoadFromAssemblyPath(@"C:\evil.dll");`) there is no
+   `TypeSyntax` for the analyzer to walk and the call is accepted. Triggering
+   execution of the loaded code from inside the sandbox is hard (most of
+   `Assembly`'s surface is in `System.Reflection`, which is denied) but the
+   defense-in-depth gap is real: an attacker who can author a script also
+   typically controls a file path on the server (Admin UI uploads, share
+   mounts) and loading an assembly is the prerequisite to every chained
+   escape — module initializers, type-resolve handlers, and a future analyzer
+   slip would all become exploitable.
+
+In addition, two lower-impact `System.*` types are reachable that arguably
+shouldn't be: **`System.Console.SetOut`** / **`Console.SetError`** could
+redirect the host's console streams (requires constructing a
+`System.IO.TextWriter`, which is blocked, so the practical exploit is
+`Console.WriteLine` log-spam only), and **`System.Globalization.CultureInfo.DefaultThreadCurrentCulture`**
+could perturb the entire process's formatting behavior (subtle but real cross-script
+side effect).
+
+The original Core.Scripting-001 finding called out the model: when an allow-listed
+namespace contains dangerous types, those types must be denied type-granularly.
+The new reference surface introduces several more such types and the deny-list
+has not been kept in sync.
+
+**Recommendation:** Add `System.Threading.ThreadPool` and `System.Threading.Timer`
+to `ForbiddenFullTypeNames`. Add `System.Runtime.Loader` as a namespace prefix
+to `ForbiddenNamespacePrefixes` (every type in `System.Runtime.Loader` —
+`AssemblyLoadContext`, `AssemblyDependencyResolver`, `AssemblyLoadEventArgs` — is
+out of script scope). Consider adding `System.Console` to `ForbiddenFullTypeNames`
+to stop log-spam through the host's console streams, and at minimum document
+`CultureInfo.DefaultThreadCurrentCulture` as an accepted cross-script side
+effect. Each addition must have a regression test in `ScriptSandboxTests`
+mirroring the Core.Scripting-010 vector style. Update
+`docs/v2/implementation/phase-7-scripting-and-alarming.md` decision #6 + the
+"Sandbox escape" compliance-check row to enumerate the additions, per the
+Core.Scripting-009 doc-sync convention.
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_
+
+### Core.Scripting-013
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Security |
+| Location | `ScriptEvaluator.cs:202-225` (`BuildWrapperSource`) |
+| Status | Open |
+
+**Description:** The synthesized wrapper pastes the user's source verbatim
+between `{` and `}` braces inside a static method body, with a `#line 1`
+directive and no escaping. The legacy `CSharpScript.CreateDelegate` path was
+robust to this because Roslyn's scripting compiler parses script source as a
+top-level statement sequence; the new hand-rolled path is parsing ordinary C# in
+a method body, so a script that injects matching `{` / `}` braces can extend the
+synthesized compilation unit with additional methods, classes, or `#line`
+directives. For example, a script body of
+`return 0; } public static int Evil() { return 0; }} public static class CompiledScript2 { public static void M() {`
+ends the `Run` method early, declares a sibling `Evil` method (and even a
+sibling `CompiledScript2` class) inside the synthesized namespace, then opens an
+unclosed method that consumes the wrapper's trailing `}\n}`. With matching brace
+counts the script parses cleanly and compiles.
+
+`ForbiddenTypeAnalyzer` walks every descendant of every syntax tree, so any
+forbidden BCL types named inside the injected methods are still caught — the
+finding is **not** a direct sandbox escape. However:
+
+- It silently relaxes the operator-visible authoring contract documented in
+  `docs/VirtualTags.md` ("scripts are statement bodies that end with an
+  explicit `return …;`") to "scripts can be any compilable C# inside the
+  `CompiledScript` namespace" — operators have access to features the design
+  did not intend to expose (local types defined as siblings of `Run`, custom
+  module initializers via attributes, etc.).
+- A script can embed its own `#line` directives that override the
+  `#line 1` we emit just above the user source, producing misleading error
+  locations in compiler diagnostics surfaced to the operator.
+- Future hardening that relies on syntactic-shape assumptions (e.g.
+  "every script has exactly one method") would silently fail.
+- It widens the analyzer's surface: the analyzer's correctness now depends on
+  Pass (2) correctly walking every conceivable C# construct that can name a
+  type, including ones a normal script body would never contain
+  (`UnmanagedCallersOnly` attribute, function pointer types `delegate*<...>`,
+  pattern types, switch arm types, …).
+
+**Recommendation:** Either (a) reject scripts whose parsed body contains
+declarations other than statements — walk the wrapper's syntax tree after parse
+and require that the only members of `CompiledScript` are the single `Run`
+method, raising a `CompilationErrorException` if anything else appears — or
+(b) parse the user source independently as a `BlockSyntax` and inject the
+parsed block as the method body via the Roslyn syntax API, which makes
+brace-mismatched / class-injecting source unparseable. Add a regression test
+covering at least the brace-injection vector
+(`return 0; } public static int Evil() { return 0;`).
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_
+
+### Core.Scripting-014
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Location | `CompiledScriptCache.cs:91-103` (`Clear`) |
+| Status | Open |
+
+**Description:** `Clear()` snapshots `_cache.Keys.ToArray()` then iterates,
+calling `TryRemove(key, out var lazy)` on each — the key-only overload, not
+the value-scoped one used in `GetOrCompile`'s catch block. Between the
+snapshot and a given `TryRemove`, a concurrent `GetOrCompile(scriptSource)`
+call that hashes to the same key can re-insert a fresh `Lazy` whose `.Value`
+the caller already retained. The unconditional `TryRemove` then removes that
+fresh `Lazy` and `DisposeLazyIfMaterialised(lazy)` calls `Dispose()` on its
+evaluator — unloading the ALC while the concurrent caller still holds a
+reference to the evaluator and intends to invoke it.
+
+This is exactly the race-window pattern the Core.Scripting-006 resolution
+fixed in `GetOrCompile`'s catch block (the test
+`Failed_compile_eviction_does_not_remove_a_concurrent_retry_entry` locks it
+there). `Clear()` carries the same shape but uses the older, value-blind
+overload, so the same race that finding-006 addresses is still latent on the
+publish-replace path.
+
+In current production wiring `Clear()` is intended for config-publish + tests
+— neither overlaps steady-state evaluation under the documented design — so
+the in-practice impact is low. But the cache is checked in as the
+forward-looking compile cache for the engines (per `Script.SourceHash`'s docs
+and the cache's own remarks); a future wiring that calls `Clear()` from
+publish while evaluations are in flight would dispose live evaluators.
+
+**Recommendation:** Replace the snapshot + `TryRemove(key, out var lazy)`
+sequence with an enumeration that captures the `Lazy` reference at snapshot
+time and uses the value-scoped `TryRemove(KeyValuePair<,>)` overload, mirroring
+the Core.Scripting-006 fix:
+
+```csharp
+foreach (var entry in _cache.ToArray())
+{
+    if (_cache.TryRemove(entry))
+        DisposeLazyIfMaterialised(entry.Value);
+}
+```
+
+Add a regression test that races `GetOrCompile` against `Clear` and asserts
+the caller's evaluator is still usable.
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_
+
+### Core.Scripting-015
+
+| Field | Value |
+|---|---|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Location | `ScriptEvaluator.cs:234-270` (`ToCSharpTypeName`) |
+| Status | Open |
+
+**Description:** `ToCSharpTypeName` is documented to handle nested types
+(`Outer+Inner` → `Outer.Inner`) via `Replace('+', '.')` for the
+non-generic path (line 269) but the generic path (line 263-266) constructs the
+name from `def.FullName!` then takes a substring up to the backtick. For a
+**nested generic** type — e.g. `Outer.Inner<T>` whose `FullName` is
+`Outer+Inner`1` — `Replace('+', '.')` is applied first, then `Substring(0, IndexOf('`'))`
+on `"Outer.Inner`1"` produces `"Outer.Inner"`, which is correct. Good.
+
+However, the generic branch does NOT handle the case where the OPEN generic
+type itself is nested with `+` inside the parent's name when the parent is
+also generic (`Outer<TOuter>.Inner<TInner>` — `FullName` is
+`Outer`1+Inner`1[[TOuter,TInner]]`). For that shape `Substring(0, IndexOf('`'))`
+truncates at the first backtick — yielding `"Outer.Inner"` — silently dropping
+the closed type arguments of `Outer<TOuter>`. The resulting source string is
+syntactically valid but semantically wrong: `global::Outer.Inner<TInner>` does
+not name `Outer<TOuter>.Inner<TInner>`.
+
+The production code never hits this shape — `TResult` is always one of
+`object?`, `bool`, `int`, `double`, `string?`, `DateTime` across the
+virtual-tag engine, the alarm engine, the test-harness, and the test suite,
+and `ScriptGlobals<TContext>` is always a top-level generic over a top-level
+`ScriptContext` subclass. The bug is latent. But it is a foot-gun for a
+future caller (e.g. a Phase-8 driver that wires a context type defined as a
+nested generic for grouping reasons) and the XML-doc comment claims
+"handles nested types" without qualifying it.
+
+A second smaller correctness gap on the same path: the comment claims
+`global::`-qualified FQNs prevent accidental capture by the wrapper's `using`
+directives, which is true for the generic / non-generic branches, but the
+primitive aliases (`bool`, `int`, `string`, `object`, …) are emitted unqualified.
+A script that defines a local `class bool` (now possible per Core.Scripting-013)
+would shadow the alias. Probably benign, but worth a comment.
+
+**Recommendation:** Add a check in the generic branch that walks the FullName
+backtick-by-backtick — or use `INamedTypeSymbol`-style name composition from
+`def.DeclaringType` recursively — so multi-arity-nested generics emit
+correctly. At minimum update the XML doc to qualify "handles nested types" as
+"handles single-level nesting; nested generics whose parent is itself generic
+are not supported". Add a `ToCSharpTypeName` unit test (currently nothing
+exercises this method directly — coverage relies on the end-to-end compile path,
+so the bug surfaces only as a misleading Roslyn diagnostic).
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_
+
+### Core.Scripting-016
+
+| Field | Value |
+|---|---|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:74-117`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs:139-182` |
+| Status | Open |
+
+**Description:** The Core.Scripting-008 resolution introduced
+`ScriptEvaluator.IDisposable` + `CompiledScriptCache.Clear()` that disposes
+each materialised evaluator before dropping its dictionary entry, so per-publish
+ALC accretion is no longer process-lifetime rooted **inside the cache**. But
+neither production consumer of `ScriptEvaluator` uses the cache — both
+`VirtualTagEngine.Load` and `ScriptedAlarmEngine.LoadAsync` call
+`ScriptEvaluator<TContext, TResult>.Compile(...)` directly (lines 105 / 160
+respectively), store the evaluator inside an internal `VirtualTagState` /
+`AlarmState` record, and on the next `Load` simply call `_tags.Clear()` /
+`_alarms.Clear()`. The dropped `ScriptEvaluator` references never have
+`Dispose()` called on them, so the underlying `ScriptAssemblyLoadContext`
+instances are never `Unload()`-ed. The .NET runtime guarantees that a
+collectible ALC stays alive until `Unload()` is called explicitly — having
+"no strong references" is necessary but not sufficient. So the publish-replace
+cycle leaks every prior generation's emitted assembly exactly as before the
+fix, even though the fix's infrastructure is in place.
+
+The Core.Scripting-008 regression tests in `CompiledScriptCacheTests`
+(`Dispose_unloads_compiled_script_assembly_load_context` /
+`Clear_disposes_every_materialised_evaluator`) prove the contract on
+`CompiledScriptCache`, but neither engine uses that class. There is no
+integration test exercising the actual publish path — i.e. that calling
+`VirtualTagEngine.Load(...)` twice with different definitions makes the prior
+generation's ALC eligible for GC. As a result the fix's headline guarantee
+("Server restarts are no longer required to reclaim compiled-script memory" —
+`docs/VirtualTags.md`) is not actually delivered to the production engines.
+
+This is the same observable behavior the original Core.Scripting-008 finding
+described, surfacing on a different code path that the resolution did not touch.
+
+**Recommendation:** Either route the engines' compile path through
+`CompiledScriptCache<TContext, TResult>` (the documented design — the cache
+already returns the same evaluator instance for identical source, and its
+`Clear()` now performs the right disposal — and `Script.SourceHash`'s doc-comment
+already names this as the cache key), or make the engines' `Load` methods
+dispose the previous `ScriptEvaluator` instances before reassigning. The
+former is the cleaner change because it also collapses redundant compiles
+across publishes for unchanged scripts. Add an integration test along the
+lines of `CompiledScriptCacheTests.Clear_disposes_every_materialised_evaluator`
+for each engine: snapshot the per-evaluator emitted assembly via
+`WeakReference`, call `Load(...)` with a different definition set, and assert
+the prior generation's assemblies become collectable.
+
+**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_