fix(concurrency): close 8 race / thread-safety findings across CD, DCL, SR

CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL
IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the
at-least-once livelock on the site→central notification handoff.

DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent
same-tag subscribes don't orphan an adapter handle; delete the latent
dead _subscriptionHandles dictionary; stop double-counting
_totalSubscribed when an unresolved tag is promoted via another instance;
release adapter handles on mid-flight unsubscribe; gate the
tag-resolution retry timer with IsTimerActive so subscribe bursts don't
reset it into starvation.

SR-020: add _terminatingActorsByName shadow so a third deploy arriving
during a pending redeploy doesn't crash on InvalidActorNameException —
displaced senders get a Failed/superseded response and the latest
command wins on Terminated.

SR-024: split OperationTrackingStore reads from writes (fresh
SqliteConnection per GetStatusAsync) so long writes don't block status
queries; rewrite Dispose to drop the sync-over-async bridge that could
deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the
dispose-once flag race-safe across both paths.
This commit is contained in:
Joseph Doherty
2026-05-28 05:20:13 -04:00
parent 5d2386cc9d
commit f936f55f51
15 changed files with 1152 additions and 170 deletions
+62 -5
View File
@@ -952,9 +952,22 @@ pre-fix code (the batch throws, no map returned) and passes after;
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
**Resolution** — added a `_subscribesInFlight` HashSet mirroring the
existing `_resolutionInFlight` pattern. `HandleSubscribe` now partitions
each request's tags on the actor thread into "this request will
SubscribeAsync" vs. "already subscribed by us OR by another in-flight
request"; the second arrival sees the tag in `_subscribesInFlight` and
treats it as `AlreadySubscribed: true` without issuing a duplicate
adapter call. `HandleSubscribeCompleted` removes each
non-AlreadySubscribed result from the set; `ReSubscribeAll` clears it on
reconnect. Regression test
`DCL018_ConcurrentSubscribes_SameTag_DifferentInstances_IssueOneAdapterSubscribe`
parks the first subscribe in flight, fires a second for the same tag,
and asserts exactly one adapter SubscribeAsync call.
**Description**
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
@@ -1004,9 +1017,19 @@ exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
**Resolution** — deleted the dead `_subscriptionHandles` field outright.
Subscription bookkeeping lives in `RealOpcUaClient._monitoredItems` /
`_callbacks` (already `ConcurrentDictionary` per DCL-003) at the device
layer and `DataConnectionActor._subscriptionIds` at the actor layer;
the adapter had no live reader and the field was a latent
race-condition trap. Added structural regression test
`DCL019_OpcUaDataConnection_HasNoNonConcurrentSharedDictionary` that
reflects over the adapter's fields and fails if any plain
`Dictionary<,>` is reintroduced.
**Description**
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
@@ -1061,9 +1084,20 @@ state.
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
**Resolution** — split the success branch into "fresh subscribe" vs
"unresolved → resolved promotion": `_unresolvedTags.Remove(...)` is now
called before incrementing the counters; a promotion increments only
`_resolvedTags` and clears `_resolutionInFlight`, mirroring
`HandleTagResolutionSucceeded`. The symmetric failure branch was also
fixed — `_totalSubscribed` only increments when
`_unresolvedTags.Add(...)` returns `true` so a second instance failing
to resolve the same tag is a no-op on the counter. Tests:
`DCL020_UnresolvedTagPromoted_ByDifferentInstance_DoesNotDoubleCountTotalSubscribed`
and `DCL020_TwoInstancesFailingSameTag_OnlyCountsTagOnceInTotal`.
**Description**
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
@@ -1115,9 +1149,22 @@ test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
**Resolution**`HandleSubscribeCompleted` now detects the
mid-termination race: when `_subscriptionsByInstance.TryGetValue` fails,
the method logs a warning, clears each owned tag from
`_subscribesInFlight`, fires fire-and-forget
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` for every successful
non-AlreadySubscribed result (so the OPC UA monitored items don't keep
streaming for a tag nobody listens to), and returns without applying
counter or handle mutations. Regression test
`DCL021_UnsubscribeDuringInFlightSubscribe_ReleasesAdapterHandle_AndKeepsStateClean`
parks the subscribe, sends `UnsubscribeTagsRequest`, then releases the
subscribe and asserts `_adapter.UnsubscribeAsync` is called and
`TotalSubscribedTags == 0`.
**Description**
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
@@ -1170,9 +1217,19 @@ for A while the subscribe I/O is in flight, completes the subscribe, and asserts
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
**Resolution** — both call sites now gate
`Timers.StartPeriodicTimer("tag-resolution-retry", ...)` with
`!Timers.IsTimerActive("tag-resolution-retry")` so the first failure
arms the timer and subsequent failures only pile onto `_unresolvedTags`.
Regression test
`DCL022_BurstedFailedSubscribes_DoNotResetRetryTimer` fires 5 failed
subscribes within one retry interval and asserts the retry timer
actually fires within one interval of the first failure (pre-fix it
would have been pushed past the interval boundary by the resets).
**Description**
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line