code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+407 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -47,6 +47,36 @@ and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
2026-05-17. Open findings: 0.
#### Re-review 2026-05-28 (commit `1eb6e97`)
The module was re-reviewed at commit `1eb6e97` as part of the new baseline
review. The SiteRuntime source surface has grown materially since the prior
pass — primarily by threading `ExecutionId`/`ParentExecutionId`/`SourceNode`
through the script-trust-boundary helpers and the cached-call telemetry
emitters, and by adding `OperationTrackingStore`, the
`AuditingDbConnection`/`AuditingDbCommand`/`AuditingDbDataReader` decorators,
and `ScriptExecutionScheduler`. All 10 checklist categories were walked afresh.
Seven new findings were recorded: a race that throws
`InvalidActorNameException` when a second `DeployInstanceCommand` arrives for
the same instance while a redeployment is still terminating its predecessor
(SiteRuntime-020, Medium); an artifact-only data-connection update that never
reaches the DCL (SiteRuntime-021, Medium); `AuditingDbCommand.DbConnection.set`
reaching into `AuditingDbConnection._inner` via reflection — the same anti-
pattern SiteRuntime-006 eliminated for the repositories, now reintroduced and
in direct tension with the script trust model that forbids `System.Reflection`
(SiteRuntime-022, Medium); `Convert.ToDouble(value)` in `ScriptActor` /
`AlarmActor` running under `CurrentCulture` so a string attribute value
becomes locale-sensitive (SiteRuntime-023, Low); `OperationTrackingStore`
serialising every cached-call write through a single connection +
`SemaphoreSlim` and using sync-over-async in `Dispose()` (SiteRuntime-024,
Medium); inbound-API `SetAttribute` (and any future caller) accepting unknown
attribute names and persisting them as overrides, polluting both `_attributes`
and the SQLite override table (SiteRuntime-025, Low); and the
`ReplicationMessages.cs` outbound/inbound record types still missing public XML
docs (SiteRuntime-026, Low). Prior findings 001019 remain
Resolved/Deferred — no regressions observed in any of their fixed call sites.
Open findings: 7.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -62,6 +92,21 @@ never routes to (SiteRuntime-019, Low). All three were subsequently resolved on
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Second-deploy race vs. pending redeploy (020); artifact-only data-connection update never reaches DCL (021); unknown-name SetAttribute persists bogus overrides (025). |
| 2 | Akka.NET conventions | ✓ | Trigger-eval blocking on coordinator mailbox remains Deferred (014); short-lived execution actors and replication actor otherwise conform. |
| 3 | Concurrency & thread safety | ✓ | DM's `_instanceActors` cache and `_pendingRedeploys` map shifted from old race; new ordering race surfaced (020). `OperationTrackingStore` single-connection + SemaphoreSlim serialises all cached writes (024). |
| 4 | Error handling & resilience | ✓ | `Task.Run` fire-and-forget replication paths log on faulted (acceptable, per "best-effort replication" design). DM's deploy persistence rollback path (resolved as SiteRuntime-005) intact. |
| 5 | Security | ✓ | Trust-model semantic analysis (SiteRuntime-011 fix) intact. `AuditingDbCommand` reflects into `AuditingDbConnection._inner` — same anti-pattern as SiteRuntime-006 (022). Audit emitter captures SQL parameter values verbatim per M4 design (M5 will redact). |
| 6 | Performance & resource management | ✓ | Per-call SQLite connections on hot paths in `SiteStorageService` (existing pattern, acceptable). `OperationTrackingStore` `Dispose()` does sync-over-async (024). `ScriptExecutionScheduler` bounded threads as expected. |
| 7 | Design-document adherence | ✓ | Artifact-only data-connection update path is silently inert (021) — contradicts the "site is self-contained after artifact deployment" intent. |
| 8 | Code organization & conventions | ✓ | Repository reflection-via-private-field anti-pattern reintroduced in `AuditingDbCommand` (022). `ReplicationMessages.cs` public records still undocumented (026). |
| 9 | Testing coverage | ✓ | `SiteReplicationActor` remains uncovered (SiteRuntime-016 deferred that gap to a clustered-ActorSystem harness, still outstanding). New findings have no targeted coverage yet. |
| 10 | Documentation & comments | ✓ | `ReplicationMessages.cs` records lack XML docs (026); other XML doc surface materially expanded in `1eb6e97`. |
## Findings
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
@@ -902,3 +947,362 @@ stating the Deployment Manager owns this lifecycle. Regression test:
`InstanceActorTests.InstanceActor_DoesNotHandleDisableOrEnableCommands` asserts the
Instance Actor produces no `InstanceLifecycleResponse` for either command
(confirmed to fail against the pre-fix dead handlers and pass after removal).
### SiteRuntime-020 — Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
**Description**
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
Actor and buffer the in-flight `DeployInstanceCommand` in `_pendingRedeploys`
until `Terminated` arrives. The handler also removes the instance from
`_instanceActors` synchronously, in step with the stop request:
```csharp
if (_instanceActors.TryGetValue(instanceName, out var existing))
{
_instanceActors.Remove(instanceName);
_pendingRedeploys[existing] = new PendingRedeploy(command, Sender);
Context.Watch(existing);
Context.Stop(existing);
UpdateInstanceCounts();
return;
}
// Fresh deployment — no existing actor to replace.
ApplyDeployment(command, Sender, isRedeploy: false);
```
If a *second* `DeployInstanceCommand` for the same `instanceName` arrives on
the singleton's mailbox while the predecessor is still terminating, the
`_instanceActors.TryGetValue` lookup correctly reports "no existing actor" —
because the first deploy already removed it — and execution falls through to
`ApplyDeployment(..., isRedeploy: false)`. `ApplyDeployment` immediately calls
`CreateInstanceActor`, which calls `Context.ActorOf(props, instanceName)`. But
the predecessor's Akka child name **is still registered** in the parent's
child registry: that name is only released after the predecessor's `Terminated`
signal — exactly the asynchronous gap SiteRuntime-003 was created to plug for
the *first* redeploy. `Context.ActorOf` therefore throws
`InvalidActorNameException`, which Akka rethrows as
`ActorInitializationException` — and the supervisor's `Stop` directive on that
exception (DeploymentManagerActor.cs:179) silently stops the just-created
child. The second deploy is then quietly lost: `_instanceActors` doesn't
contain it (the throw aborted the bookkeeping after `CreateInstanceActor`'s
own `ContainsKey` guard but before `_instanceActors[instanceName] = actorRef`
would have run), `_totalDeployedCount` was incremented, and the deployer is
never told the deployment failed (the persistence `Task.Run` is also dropped
on the throw path). The race is real on a busy site where central retries a
deploy because the prior attempt timed out — exactly the scenario the
DeploymentManager-006 query-then-deploy idempotency mechanism was designed for.
The first-redeploy case (SiteRuntime-003) does NOT exhibit this because at
that point the predecessor's child name was still in `_instanceActors`, so the
branch correctly buffers. The bug is specific to the third (and beyond)
incoming deploy when two are already in flight for the same instance.
**Recommendation**
The pending-redeploy bookkeeping needs to be authoritative for "we are mid-
redeploy on this instance", not just the `_instanceActors` cache. Add a second
keyed lookup — e.g. a `Dictionary<string, IActorRef> _terminatingActorsByName`
populated when the predecessor is stopped — and check it BEFORE
`ApplyDeployment(isRedeploy: false)`. On a hit, overwrite (or stash) the
buffered `PendingRedeploy` for that terminating actor so the latest command
wins on the `Terminated` signal. Alternatively, defer the deploy by stashing
all messages for that `instanceName` until the predecessor terminates (Akka
`Stash` pattern). Either way, the fall-through to "fresh deployment" needs to
be gated on "no instance with this name is currently terminating".
### SiteRuntime-021 — `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:931` |
**Description**
`HandleDeployArtifacts` persists the artifact bundle (shared scripts, external
systems, database connections, notification lists, SMTP configs, and
**data connection definitions**) into local SQLite. For data connection
definitions specifically (`DataConnections`), the handler calls
`_storage.StoreDataConnectionDefinitionAsync(...)` — but does NOT issue a
`CreateConnectionCommand` (or any other DCL command) to the `_dclManager`
actor. The only path that pushes DCL configuration to the DCL is
`EnsureDclConnections`, called exclusively from the deploy / startup-batch
paths against the **flattened instance configuration's** inline `Connections`
dictionary. There is no equivalent for an artifact-only update.
Concretely: an artifact deployment that changes a data connection's endpoint
URL, credentials, backup endpoint, or failover retry count is stored
durably in the site SQLite (so on the *next* node restart the site loads the
new config and `EnsureDclConnections` picks it up) but is silently inert until
either an instance using that connection is redeployed or the node restarts.
This contradicts the design's "after artifact deployment, the site is fully
self-contained" intent (Component-SiteRuntime.md, "System-Wide Artifact
Handling") — the runtime DCL keeps using the stale connection until a much
heavier trigger event occurs. It is also asymmetric with how
`SharedScripts` are handled in the same method: shared scripts are both
stored *and* recompiled into `_sharedScriptLibrary` on update so the change is
live immediately.
(SiteRuntime-010 fixed a related defect inside `EnsureDclConnections` — the
config-hash cache — but that's only consulted on the inline-config path; the
artifact-deployment path never reaches `EnsureDclConnections`.)
**Recommendation**
In the `DataConnections` branch of `HandleDeployArtifacts`, after the
`StoreDataConnectionDefinitionAsync` call, also send a
`CreateConnectionCommand` to `_dclManager` for each updated definition,
re-using the SiteRuntime-010 config hash so unchanged connections are skipped.
Alternatively, refactor `EnsureDclConnections` to accept a flat list of
`(name, protocol, configurationJson, backupConfigurationJson,
failoverRetryCount)` tuples that both the inline (`FlattenedConfiguration`)
and artifact paths can drive through it.
### SiteRuntime-022 — `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner`
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Scripts/AuditingDbCommand.cs:138` |
**Description**
The `DbConnection` setter on `AuditingDbCommand` unwraps an
`AuditingDbConnection` value by reading its private `_inner` field via
reflection:
```csharp
set
{
_wrappingConnection = value;
_inner.Connection = value switch
{
AuditingDbConnection auditing => auditing.GetType()
.GetField("_inner", BindingFlags.Instance | BindingFlags.NonPublic)
!.GetValue(auditing) as DbConnection,
_ => value
};
}
```
This is the same encapsulation-violating anti-pattern that SiteRuntime-006
called out for the site repositories. A rename or refactor of
`AuditingDbConnection._inner` breaks the audit decorator at runtime (no
compile-time signal), the `!.` null-forgiving operator hides the crash, and
the reflective access trips static analyzers and IL trimming. More
problematically, the script trust model the same module enforces in
`ScriptCompilationService.ValidateTrustModel` explicitly forbids
`System.Reflection` in scripts — yet the auditing helper a script ends up
running through itself reaches via reflection into a sibling class. Both
classes are `internal sealed` in the same assembly, so this is purely a
self-imposed contract violation.
A second smaller concern in the same property: the getter returns
`_wrappingConnection ?? _inner.Connection`. If the caller obtains a command
via `AuditingDbConnection.CreateDbCommand()` and immediately reads
`cmd.Connection`, the getter returns the raw inner connection (not the
auditing wrapper), because `_wrappingConnection` is only populated when the
setter is later invoked. That's surprising and at odds with the class's
audit-everything intent — a script that round-trips a command through
`cmd.Connection` re-enters the un-audited path.
**Recommendation**
Expose the wrapped connection through a proper API surface. The simplest fix
that matches the SiteRuntime-006 precedent: add an
`internal DbConnection Inner { get; }` property to `AuditingDbConnection`
(both classes are `internal sealed`, so the property stays out of the public
surface) and replace the reflection switch with `auditing.Inner`. While
touching the property, also have the getter return `_wrappingConnection` even
on the synthesised CreateDbCommand path (e.g. set `_wrappingConnection` to
the parent connection inside `AuditingDbConnection.CreateDbCommand`).
### SiteRuntime-023 — `Convert.ToDouble(value)` in trigger and alarm evaluation is locale-sensitive
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:446`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:340`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:356`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:444` |
**Description**
`ScriptActor.EvaluateCondition` and the three `AlarmActor` evaluators
(`EvaluateRangeViolation`, `EvaluateRateOfChange`, `EvaluateHiLo`) call
`Convert.ToDouble(value)` without specifying a culture. When `value` is a
string (a path that exists today — attribute values that arrive as JSON-
deserialized numbers can still surface as strings on some code paths,
particularly array values that are JSON-stringified at
`InstanceActor.HandleTagValueUpdate:377`), `Convert.ToDouble` parses against
`CultureInfo.CurrentCulture`. On a host whose locale uses a comma decimal
separator (German, French, most of continental Europe), `"1.5"` throws and
the condition / alarm silently degrades to its catch-fallthrough (returns
`false` for range/rate-of-change, keeps current level for HiLo, falls back to
string-compare for conditionals). The CLAUDE.md "All timestamps are UTC"
discipline is the equivalent rule for time; there is no equivalent invariant-
culture discipline applied to numeric parsing.
The exposure is bounded — most attribute values arrive as numeric primitives
from `TagValueUpdate.Value` or static `FlattenedConfiguration.Attributes`
(also typed) so the implicit-cast `Convert.ToDouble` path is hit. But the
string path is reachable via inbound API writes
(`RouteToSetAttributesRequest.AttributeValues` is `IReadOnlyDictionary<string,
string>`), via the JSON-array stringification at `HandleTagValueUpdate:377`,
and via static-override values loaded from SQLite (which are persisted as
strings — see `SetStaticOverrideAsync`).
**Recommendation**
Replace each `Convert.ToDouble(value)` with `Convert.ToDouble(value,
CultureInfo.InvariantCulture)`, or front-load a typed-numeric extraction
helper (`if (value is double d) return d; if (value is string s && double.TryParse(s,
NumberStyles.Float, CultureInfo.InvariantCulture, out var p)) return p;
return Convert.ToDouble(value, CultureInfo.InvariantCulture);`). The site is a
deterministic machine-control surface; condition evaluation must not depend
on the host's regional settings.
### SiteRuntime-024 — `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
**Description**
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
public method through a single `SemaphoreSlim(1, 1)`. The class XML comment
calls this out as deliberate ("the M3 brief calls out as 'cleaner than the M2
Channel<T> pipeline given the volume'"), and the *write* volume is genuinely
low — at most a handful of lifecycle rows per cached call. But on a busy site
the *read* path (`GetStatusAsync`) is called by every `Tracking.Status(id)`
invocation from every executing script, and reads are serialised through the
same gate as writes. A long-running write (e.g. a Roslyn-script-driven
`RecordTerminalAsync` competing with an SQLite checkpoint) holds the gate and
stalls every concurrent status query. SQLite supports concurrent readers with
a single writer in WAL mode; the gate forfeits that capability.
A separate concern in the same class: `Dispose()` calls
`DisposeAsyncCore().AsTask().GetAwaiter().GetResult()`. That is sync-over-
async — the very pattern SiteRuntime-008 was a finding for. If a caller
disposes the store from a synchronization context that does not allow
re-entrance (e.g. an `IHostedService.StopAsync` continuation observed on the
host's sync context, or a finalizer pumping on the thread pool with a stuck
continuation), the `.WaitAsync()` inside `DisposeAsyncCore` waits for a
continuation that will never run, and the dispose deadlocks. The async path
itself is correct; only the sync `Dispose()` wrapper is risky.
**Recommendation**
For the single-connection gate: split reads and writes into separate gates,
or — better — keep the writer single-connection and open a fresh read
connection (or pool of read connections) per `GetStatusAsync` call. SQLite
connections are cheap; the `SiteStorageService` precedent already uses per-
call connections on the read path. For `Dispose()`: prefer
`Dispose() { GC.SuppressFinalize(this); _connection.Dispose(); _gate.Dispose(); }`
without an awaited disposal, and have the `IAsyncDisposable.DisposeAsync`
path do the awaiting. If a synchronous disposable is genuinely needed, do
not bridge it through the async core — duplicate the dispose-once flag check
into a sync path that calls `_connection.Dispose()` directly.
### SiteRuntime-025 — `HandleSetStaticAttribute` persists unknown attribute names as static overrides
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:223`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:246` |
**Description**
`HandleSetStaticAttribute` resolves the target attribute against
`_configuration.Attributes` to decide whether to route the write to the DCL or
treat it as a static-override write. If the lookup fails (`resolved == null`),
`isDataSourced` is false, and execution falls through to
`HandleSetStaticAttributeCore` — which unconditionally:
1. inserts the bogus key into the in-memory `_attributes` dictionary,
2. publishes an `AttributeValueChanged` for the bogus key to the site stream
and to every child Script/Alarm actor,
3. persists a row in `static_attribute_overrides` for the bogus key, and
4. replies `Success = true` to the caller.
Concretely, an inbound API `Route.To().SetAttribute("notARealAttr", "x")`
returns success, pollutes the in-memory state with a key that no script can
legitimately observe (canonical-name lookup will not produce it), persists a
durable SQLite override row that survives restart, and (on every restart)
re-injects the polluting key via `HandleOverridesLoaded` at line 608. The
override is **not** reset on instance redeployment in the same way the
"genuine" overrides are — `ClearStaticOverridesAsync` does clear by
`instance_unique_name`, so the row is eventually cleaned, but only on a full
redeploy; in the meantime each restart resurrects it. The publish-to-stream
side effect also lets a hostile or buggy inbound caller spam debug-view
subscribers with synthetic attribute changes.
Worth flagging at Low: the inbound API surface is already authenticated and
the design assumes its callers are trusted. But the no-validation behaviour
contradicts the design doc's "Scripts can only read/write attributes on their
own instance" framing — an inbound API call inherits the same instance-scope
authority as a script, and the script trust model wouldn't sanction this.
**Recommendation**
In `HandleSetStaticAttribute`, when `resolved == null`, reply
`SetStaticAttributeResponse(Success: false,
ErrorMessage: $"Attribute '{command.AttributeName}' not found on instance
'{_instanceUniqueName}'")` instead of falling through to the override path.
Optionally also surface the existence check on the `RouteInboundApiSetAttributes`
fan-out so a multi-attribute write reports the offending key without rolling
back the others (the per-attribute `Ask` shape already supports a partial
failure response).
### SiteRuntime-026 — `ReplicationMessages.cs` public record types have no XML documentation
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:10`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:13`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:15`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:17`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:19`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:25`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:28`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:30`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:32`, `src/ScadaLink.SiteRuntime/Messages/ReplicationMessages.cs:34` |
**Description**
The ten public record types in `ReplicationMessages.cs`
(`ReplicateConfigDeploy`, `ReplicateConfigRemove`, `ReplicateConfigSetEnabled`,
`ReplicateArtifacts`, `ReplicateStoreAndForward`, `ApplyConfigDeploy`,
`ApplyConfigRemove`, `ApplyConfigSetEnabled`, `ApplyArtifacts`,
`ApplyStoreAndForward`) carry no XML documentation. The file header comment
groups them as "outbound" vs "inbound" but the individual records have no
`<summary>` and no parameter docs. The XML-doc baseline `1eb6e97` rolled out
across the rest of the module (the commit being reviewed is literally `docs:
add XML doc comments across src + Sister Projects section in CLAUDE.md`), so
this file is now the conspicuous outlier — and the `CommentChecker` skill
relied on by the `fixdocs` workflow will flag every record as missing docs.
**Recommendation**
Add a `<summary>` per record naming the direction (outbound → peer / inbound
from peer) and what the operation replicates, and `<param>` docs for each
record parameter. Mirror the precedent in
`src/ScadaLink.Commons/Messages/.../*.cs`. While there, consider sealing the
inbound vs outbound split with a marker base type (currently they're just
named conventionally) so `Receive<ReplicateXxx>` vs `Receive<ApplyXxx>` is
expressed at the type level — but that's optional and out of scope for a
docs-only finding.