fix(site-runtime): resolve SiteRuntime-004..011 — deploy-after-persist, remove reflection, deterministic IDs, non-blocking startup, dedicated script scheduler, config-change detection, semantic trust-model check

This commit is contained in:
Joseph Doherty
2026-05-16 21:44:10 -04:00
parent 24a4a2d165
commit a88bec9376
17 changed files with 1112 additions and 150 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 13 |
| Open findings | 5 |
## Summary
@@ -176,10 +176,10 @@ longer drifts (this additionally addresses the root cause behind SiteRuntime-004
| | |
|--|--|
| Severity | Medium |
| Severity | Medium — re-triaged: already fixed by the SiteRuntime-003 resolution. |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:239` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`) |
**Description**
@@ -193,16 +193,24 @@ grow, but the in-memory `_totalDeployedCount` (reported to the health collector
`UpdateInstanceCounts`) drifts upward and the reported "disabled" count becomes
wrong.
**Recommendation**
**Re-triage (2026-05-16)**
Only increment `_totalDeployedCount` when the instance is genuinely new. Either
track whether this deploy replaced an existing config, or derive the deployed count
from storage / the union of running actors and disabled configs rather than
maintaining a hand-incremented counter.
Verified against the current source: this is **already fixed**. The SiteRuntime-003
resolution replaced the fixed-delay reschedule with a shared `ApplyDeployment` helper
that takes an `isRedeploy` flag and guards the counter with `if (!isRedeploy)
_totalDeployedCount++;`. The redeploy path (`HandleTerminated`) always calls
`ApplyDeployment(..., isRedeploy: true)`, so the counter is no longer bumped on
redeployment. The regression test
`DeploymentManagerRedeployTests.Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`
already covers this and passes. No further code change was required.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): no new change needed — the root cause was
eliminated by the SiteRuntime-003 fix (the `isRedeploy` guard in `ApplyDeployment`).
Confirmed by the existing passing regression test
`Redeploy_ExistingInstance_DoesNotOverCountDeployedInstances`. Re-triaged from Open to
Resolved.
### SiteRuntime-005 — Deployment reports `Success` to central before persistence completes
@@ -210,8 +218,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:272` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`ApplyDeployment`, `HandleDeployPersistenceResult`) |
**Description**
@@ -232,7 +240,16 @@ At minimum, do not report `Success` until the config row is committed.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ApplyDeployment` sent
`DeploymentStatusResponse(Success)` synchronously before the persistence `Task.Run`
completed. The `Success` reply is now sent from `HandleDeployPersistenceResult` only
once the persistence result is known: on success it replies `Success`; on a
persistence failure it logs the error, stops the optimistically-created Instance
Actor, rolls back the deployed-instance counter, and replies
`DeploymentStatus.Failed` with the error message. `DeployPersistenceResult` carries an
`IsRedeploy` flag so the counter rollback is skipped for redeployments. Regression
tests: `DeploymentManagerMediumFindingsTests.Deploy_PersistenceFailure_ReportsFailedNotSuccess`
and `Deploy_Success_ReportsSuccessAndPersistsConfig`.
### SiteRuntime-006 — Site-local repositories read `SiteStorageService` private field via reflection
@@ -240,8 +257,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:183`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:181` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs` |
**Description**
@@ -263,7 +280,16 @@ repositories. Remove the reflection entirely.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories
reflected into `SiteStorageService._connectionString`. `SiteStorageService` now
exposes a public `CreateConnection()` factory method that returns an unopened
`SqliteConnection` against the site database. Both `SiteExternalSystemRepository` and
`SiteNotificationRepository` now obtain connections via `_storage.CreateConnection()`;
all reflection (`Type.GetField` / `BindingFlags`) and the contradictory XML comments
have been removed. This is a fully in-module refactor — no cross-module design
decision was needed. Regression test:
`SiteRepositoryTests.ExternalSystemRepository_RoundTripsStoredDefinition` exercises
the repository's connection path end-to-end.
### SiteRuntime-007 — Synthetic entity IDs use the non-deterministic `string.GetHashCode()`
@@ -271,8 +297,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:241`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:254` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs`, `src/ScadaLink.SiteRuntime/Repositories/SyntheticId.cs` |
**Description**
@@ -294,7 +320,18 @@ rather than synthesising integer IDs.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — both repositories used
`name.GetHashCode()`, which is per-process randomized on .NET Core. A new internal
`SyntheticId` helper computes a deterministic, process-stable 31-bit ID using the
FNV-1a hash over the name's UTF-8 bytes. Both `GenerateSyntheticId` methods now
delegate to `SyntheticId.From(name)`. (The integer-keyed lookups are kept because
they are mandated by the shared `IExternalSystemRepository`/`INotificationRepository`
contracts in Commons — changing those contracts to name-keyed would be a cross-module
change outside this module's scope; the deterministic hash resolves the correctness
defect within scope.) Regression tests:
`SiteRepositoryTests.ExternalSystemRepository_SyntheticId_IsStableAcrossRestart` and
`NotificationRepository_SyntheticId_IsStableAcrossRestart` re-create the service to
simulate a process restart and confirm by-ID lookups still resolve.
### SiteRuntime-008 — Blocking `.GetAwaiter().GetResult()` on the actor thread during startup
@@ -302,8 +339,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:479` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`HandleStartupConfigsLoaded`, `LoadSharedScriptsFromStorage`, `HandleSharedScriptsLoaded`) |
**Description**
@@ -327,7 +364,18 @@ back.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the blocking
`.GetAwaiter().GetResult()` and Roslyn compilation ran on the singleton's mailbox
thread inside `HandleStartupConfigsLoaded`. `LoadSharedScriptsFromStorage` now runs
the SQLite read **and** the Roslyn compilation on a background `Task.Run` and pipes a
new internal `SharedScriptsLoaded` message back to the actor. A new
`HandleSharedScriptsLoaded` handler then begins staggered Instance Actor creation, so
the compilation→creation ordering is preserved without ever blocking the mailbox. A
shared-script load failure is logged and startup proceeds (scripts needing a missing
shared script fail at execution time). Regression test:
`DeploymentManagerMediumFindingsTests.Startup_WithSharedScripts_LoadsConfigsAndStaysResponsive`
(confirms startup completes and the actor stays responsive with shared scripts
present).
### SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher
@@ -335,8 +383,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:72`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:289`, `src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs:57` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs`, `src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs`, `src/ScadaLink.SiteRuntime/Scripts/ScriptExecutionScheduler.cs` |
**Description**
@@ -359,7 +407,19 @@ way, remove the "in production, configure…" comments by actually configuring i
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — script and alarm
on-trigger bodies ran inside a bare `Task.Run` on the shared `ThreadPool`. The
recommendation's `TaskScheduler` option was taken because it is fully in-module (a
HOCON dispatcher would require editing the Host's ActorSystem config, out of scope).
A new `ScriptExecutionScheduler` provides a bounded set of dedicated background
threads (count from the new `SiteRuntimeOptions.ScriptExecutionThreadCount`, default
8). `ScriptExecutionActor` and `AlarmExecutionActor` now run their bodies via
`Task.Factory.StartNew(..., ScriptExecutionScheduler.Shared(options)).Unwrap()`
instead of `Task.Run`, so blocking script I/O is contained to those dedicated threads
and cannot starve the global pool. The misleading "in production, configure a
dedicated dispatcher" comments were removed. Regression tests:
`ScriptExecutionSchedulerTests` (`Scheduler_RunsWork_OffTheThreadPool`,
`Scheduler_RespectsConfiguredThreadCount`, `Scheduler_Shared_ReturnsSameInstanceForOptions`).
### SiteRuntime-010 — `EnsureDclConnections` never updates a connection whose configuration changed
@@ -367,8 +427,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:413` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs` (`EnsureDclConnections`, `ComputeConnectionConfigHash`) |
**Description**
@@ -390,7 +450,15 @@ the name.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — the cache was a
name-only `HashSet`, so a changed connection config was silently dropped.
`_createdConnections` is now a `Dictionary<string,string>` mapping connection name to
a SHA-256 hash of its protocol/primary-config/backup-config/failover-retry-count
(`ComputeConnectionConfigHash`). A connection whose hash is unchanged is still
skipped; a connection whose config changed re-issues a `CreateConnectionCommand` so
the DCL adopts the new configuration. Regression tests:
`DeploymentManagerMediumFindingsTests.EnsureDclConnections_ConnectionConfigChanged_ReissuesCreateCommand`
and `EnsureDclConnections_UnchangedConfig_DoesNotReissueCreateCommand`.
### SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive
@@ -398,8 +466,8 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs:52` |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs` (`ValidateTrustModel`) |
**Description**
@@ -430,7 +498,22 @@ unused `isAllowed` variable.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (`commit pending`): root cause confirmed — `ValidateTrustModel`
was a raw `string.Contains`/`IndexOf` scan of the source text, with a dead `isAllowed`
variable. It is now Roslyn semantic analysis: the script is parsed and a
`CSharpCompilation` + `SemanticModel` are built; every name/member/object-creation
node is resolved to its symbol and the symbol's containing namespace and
fully-qualified containing type are checked against the forbidden roots. Bare
namespace symbols are ignored (so the `System.Threading` qualifier of the allowed
`System.Threading.Tasks.Task` no longer false-positives). A name that cannot be
resolved (a type from an assembly deliberately absent from the script's references)
falls back to a syntactic fully-qualified-name check, so e.g. `System.Net.Http`
references are still rejected. The dead `isAllowed` variable was removed. This fixes
both the bypass (`global::`/alias-qualified forbidden types) and the false positives
(forbidden namespace string in a comment, string literal, or unrelated identifier).
Regression tests: new `TrustModelSemanticTests` (alias/`global::` detection, comment/
literal/identifier non-detection, allowed-exception resolution); all 39 existing
`SandboxTests` + `ScriptCompilationServiceTests` continue to pass.
### SiteRuntime-012 — `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask