fix(concurrency): close 8 race / thread-safety findings across CD, DCL, SR

CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL
IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the
at-least-once livelock on the site→central notification handoff.

DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent
same-tag subscribes don't orphan an adapter handle; delete the latent
dead _subscriptionHandles dictionary; stop double-counting
_totalSubscribed when an unresolved tag is promoted via another instance;
release adapter handles on mid-flight unsubscribe; gate the
tag-resolution retry timer with IsTimerActive so subscribe bursts don't
reset it into starvation.

SR-020: add _terminatingActorsByName shadow so a third deploy arriving
during a pending redeploy doesn't crash on InvalidActorNameException —
displaced senders get a Failed/superseded response and the latest
command wins on Terminated.

SR-024: split OperationTrackingStore reads from writes (fresh
SqliteConnection per GetStatusAsync) so long writes don't block status
queries; rewrite Dispose to drop the sync-over-async bridge that could
deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the
dispose-once flag race-safe across both paths.
This commit is contained in:
Joseph Doherty
2026-05-28 05:20:13 -04:00
parent 5d2386cc9d
commit f936f55f51
15 changed files with 1152 additions and 170 deletions
+30 -2
View File
@@ -954,9 +954,22 @@ Instance Actor produces no `InstanceLifecycleResponse` for either command
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
**Resolution** — added a name → terminating-actor-ref shadow
(`_terminatingActorsByName`) populated when `HandleDeploy` stops the
predecessor and cleared in `HandleTerminated`. `HandleDeploy` now
detects the mid-termination state before falling through to
`ApplyDeployment(fresh)`: on hit it tells the displaced
`PendingRedeploy.OriginalSender` a `DeploymentStatus.Failed` /
"superseded by newer deployment …" response and overwrites the buffered
pending command (last-write-wins). Regression test
`SR020_ThreeRapidDeploys_DoNotThrowInvalidActorNameException_LatestWins`
fires three rapid deploys, asserts the middle deploy is told it was
superseded, the latest succeeds, and the resulting instance is operable
(DisableInstanceCommand works).
**Description**
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
@@ -1181,9 +1194,24 @@ on the host's regional settings.
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
**Resolution** — split reads from writes: the single owned
`_writeConnection` + `_writeGate` still serialises writers, but
`GetStatusAsync` now opens a fresh `SqliteConnection` per call against
the shared connection string (mirroring `SiteStorageService`) so reads
never block on an in-flight write. Sync `Dispose` was rewritten to NOT
bridge to async — the dispose-once flag is an `int` flipped with
`Interlocked.Exchange`, the synchronous path disposes
`_writeConnection` + `_writeGate` directly without acquiring the gate,
and `DisposeAsync` retains the gate-drain semantics for graceful
shutdown. Both paths are idempotent; the second call short-circuits via
the interlocked flag. Tests:
`SR024_ConcurrentReads_DoNotBlockOnInFlightWrite`,
`SR024_SyncDispose_DoesNotDeadlock_WhenInvokedFromFreshThread`, and
`SR024_AsyncDispose_DoesNotDeadlock_AndIsIdempotent`.
**Description**
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every