fix(concurrency): close 8 race / thread-safety findings across CD, DCL, SR
CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the at-least-once livelock on the site→central notification handoff. DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent same-tag subscribes don't orphan an adapter handle; delete the latent dead _subscriptionHandles dictionary; stop double-counting _totalSubscribed when an unresolved tag is promoted via another instance; release adapter handles on mid-flight unsubscribe; gate the tag-resolution retry timer with IsTimerActive so subscribe bursts don't reset it into starvation. SR-020: add _terminatingActorsByName shadow so a third deploy arriving during a pending redeploy doesn't crash on InvalidActorNameException — displaced senders get a Failed/superseded response and the latest command wins on Terminated. SR-024: split OperationTrackingStore reads from writes (fresh SqliteConnection per GetStatusAsync) so long writes don't block status queries; rewrite Dispose to drop the sync-over-async bridge that could deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the dispose-once flag race-safe across both paths.
This commit is contained in:
@@ -952,9 +952,22 @@ pre-fix code (the batch throws, no map returned) and passes after;
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
|
||||
|
||||
**Resolution** — added a `_subscribesInFlight` HashSet mirroring the
|
||||
existing `_resolutionInFlight` pattern. `HandleSubscribe` now partitions
|
||||
each request's tags on the actor thread into "this request will
|
||||
SubscribeAsync" vs. "already subscribed by us OR by another in-flight
|
||||
request"; the second arrival sees the tag in `_subscribesInFlight` and
|
||||
treats it as `AlreadySubscribed: true` without issuing a duplicate
|
||||
adapter call. `HandleSubscribeCompleted` removes each
|
||||
non-AlreadySubscribed result from the set; `ReSubscribeAll` clears it on
|
||||
reconnect. Regression test
|
||||
`DCL018_ConcurrentSubscribes_SameTag_DifferentInstances_IssueOneAdapterSubscribe`
|
||||
parks the first subscribe in flight, fires a second for the same tag,
|
||||
and asserts exactly one adapter SubscribeAsync call.
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
|
||||
@@ -1004,9 +1017,19 @@ exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
|
||||
|
||||
**Resolution** — deleted the dead `_subscriptionHandles` field outright.
|
||||
Subscription bookkeeping lives in `RealOpcUaClient._monitoredItems` /
|
||||
`_callbacks` (already `ConcurrentDictionary` per DCL-003) at the device
|
||||
layer and `DataConnectionActor._subscriptionIds` at the actor layer;
|
||||
the adapter had no live reader and the field was a latent
|
||||
race-condition trap. Added structural regression test
|
||||
`DCL019_OpcUaDataConnection_HasNoNonConcurrentSharedDictionary` that
|
||||
reflects over the adapter's fields and fails if any plain
|
||||
`Dictionary<,>` is reintroduced.
|
||||
|
||||
**Description**
|
||||
|
||||
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
|
||||
@@ -1061,9 +1084,20 @@ state.
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
|
||||
|
||||
**Resolution** — split the success branch into "fresh subscribe" vs
|
||||
"unresolved → resolved promotion": `_unresolvedTags.Remove(...)` is now
|
||||
called before incrementing the counters; a promotion increments only
|
||||
`_resolvedTags` and clears `_resolutionInFlight`, mirroring
|
||||
`HandleTagResolutionSucceeded`. The symmetric failure branch was also
|
||||
fixed — `_totalSubscribed` only increments when
|
||||
`_unresolvedTags.Add(...)` returns `true` so a second instance failing
|
||||
to resolve the same tag is a no-op on the counter. Tests:
|
||||
`DCL020_UnresolvedTagPromoted_ByDifferentInstance_DoesNotDoubleCountTotalSubscribed`
|
||||
and `DCL020_TwoInstancesFailingSameTag_OnlyCountsTagOnceInTotal`.
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
|
||||
@@ -1115,9 +1149,22 @@ test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
|
||||
|
||||
**Resolution** — `HandleSubscribeCompleted` now detects the
|
||||
mid-termination race: when `_subscriptionsByInstance.TryGetValue` fails,
|
||||
the method logs a warning, clears each owned tag from
|
||||
`_subscribesInFlight`, fires fire-and-forget
|
||||
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` for every successful
|
||||
non-AlreadySubscribed result (so the OPC UA monitored items don't keep
|
||||
streaming for a tag nobody listens to), and returns without applying
|
||||
counter or handle mutations. Regression test
|
||||
`DCL021_UnsubscribeDuringInFlightSubscribe_ReleasesAdapterHandle_AndKeepsStateClean`
|
||||
parks the subscribe, sends `UnsubscribeTagsRequest`, then releases the
|
||||
subscribe and asserts `_adapter.UnsubscribeAsync` is called and
|
||||
`TotalSubscribedTags == 0`.
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
|
||||
@@ -1170,9 +1217,19 @@ for A while the subscribe I/O is in flight, completes the subscribe, and asserts
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
|
||||
|
||||
**Resolution** — both call sites now gate
|
||||
`Timers.StartPeriodicTimer("tag-resolution-retry", ...)` with
|
||||
`!Timers.IsTimerActive("tag-resolution-retry")` so the first failure
|
||||
arms the timer and subsequent failures only pile onto `_unresolvedTags`.
|
||||
Regression test
|
||||
`DCL022_BurstedFailedSubscribes_DoNotResetRetryTimer` fires 5 failed
|
||||
subscribes within one retry interval and asserts the retry timer
|
||||
actually fires within one interval of the first failure (pre-fix it
|
||||
would have been pushed past the interval boundary by the resets).
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
|
||||
|
||||
Reference in New Issue
Block a user