fix(concurrency): close 8 race / thread-safety findings across CD, DCL, SR
CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the at-least-once livelock on the site→central notification handoff. DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent same-tag subscribes don't orphan an adapter handle; delete the latent dead _subscriptionHandles dictionary; stop double-counting _totalSubscribed when an unresolved tag is promoted via another instance; release adapter handles on mid-flight unsubscribe; gate the tag-resolution retry timer with IsTimerActive so subscribe bursts don't reset it into starvation. SR-020: add _terminatingActorsByName shadow so a third deploy arriving during a pending redeploy doesn't crash on InvalidActorNameException — displaced senders get a Failed/superseded response and the latest command wins on Terminated. SR-024: split OperationTrackingStore reads from writes (fresh SqliteConnection per GetStatusAsync) so long writes don't block status queries; rewrite Dispose to drop the sync-over-async bridge that could deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the dispose-once flag race-safe across both paths.
This commit is contained in:
@@ -891,9 +891,22 @@ columns) — asserting each column keeps an `EncryptedStringConverter`.
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45` |
|
||||
|
||||
**Resolution** — rewrote `InsertIfNotExistsAsync` as a single raw-SQL
|
||||
`IF NOT EXISTS (...) INSERT` matching the
|
||||
`AuditLogRepository.InsertIfNotExistsAsync` and
|
||||
`SiteCallAuditRepository.UpsertAsync` patterns, with a `SqlException`
|
||||
catch on numbers 2601 (unique-index violation) and 2627
|
||||
(primary-key/unique-constraint violation) returning `false`. Concurrent
|
||||
losers are logged at Debug and treated as no-ops, eliminating the
|
||||
site-retry livelock. Two SQLite-targeted assertions in
|
||||
`RepositoryCoverageTests` were migrated to a new MS SQL-fixture file
|
||||
`tests/ScadaLink.ConfigurationDatabase.Tests/Repositories/NotificationOutboxRepositoryIntegrationTests.cs`,
|
||||
which also adds a 50-way parallel race test verifying exactly one row
|
||||
lands and no exception bubbles.
|
||||
|
||||
**Description**
|
||||
|
||||
`InsertIfNotExistsAsync` does `AnyAsync(x => x.NotificationId == n.NotificationId)`,
|
||||
|
||||
@@ -952,9 +952,22 @@ pre-fix code (the batch throws, no map returned) and passes after;
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
|
||||
|
||||
**Resolution** — added a `_subscribesInFlight` HashSet mirroring the
|
||||
existing `_resolutionInFlight` pattern. `HandleSubscribe` now partitions
|
||||
each request's tags on the actor thread into "this request will
|
||||
SubscribeAsync" vs. "already subscribed by us OR by another in-flight
|
||||
request"; the second arrival sees the tag in `_subscribesInFlight` and
|
||||
treats it as `AlreadySubscribed: true` without issuing a duplicate
|
||||
adapter call. `HandleSubscribeCompleted` removes each
|
||||
non-AlreadySubscribed result from the set; `ReSubscribeAll` clears it on
|
||||
reconnect. Regression test
|
||||
`DCL018_ConcurrentSubscribes_SameTag_DifferentInstances_IssueOneAdapterSubscribe`
|
||||
parks the first subscribe in flight, fires a second for the same tag,
|
||||
and asserts exactly one adapter SubscribeAsync call.
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
|
||||
@@ -1004,9 +1017,19 @@ exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
|
||||
|
||||
**Resolution** — deleted the dead `_subscriptionHandles` field outright.
|
||||
Subscription bookkeeping lives in `RealOpcUaClient._monitoredItems` /
|
||||
`_callbacks` (already `ConcurrentDictionary` per DCL-003) at the device
|
||||
layer and `DataConnectionActor._subscriptionIds` at the actor layer;
|
||||
the adapter had no live reader and the field was a latent
|
||||
race-condition trap. Added structural regression test
|
||||
`DCL019_OpcUaDataConnection_HasNoNonConcurrentSharedDictionary` that
|
||||
reflects over the adapter's fields and fails if any plain
|
||||
`Dictionary<,>` is reintroduced.
|
||||
|
||||
**Description**
|
||||
|
||||
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
|
||||
@@ -1061,9 +1084,20 @@ state.
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
|
||||
|
||||
**Resolution** — split the success branch into "fresh subscribe" vs
|
||||
"unresolved → resolved promotion": `_unresolvedTags.Remove(...)` is now
|
||||
called before incrementing the counters; a promotion increments only
|
||||
`_resolvedTags` and clears `_resolutionInFlight`, mirroring
|
||||
`HandleTagResolutionSucceeded`. The symmetric failure branch was also
|
||||
fixed — `_totalSubscribed` only increments when
|
||||
`_unresolvedTags.Add(...)` returns `true` so a second instance failing
|
||||
to resolve the same tag is a no-op on the counter. Tests:
|
||||
`DCL020_UnresolvedTagPromoted_ByDifferentInstance_DoesNotDoubleCountTotalSubscribed`
|
||||
and `DCL020_TwoInstancesFailingSameTag_OnlyCountsTagOnceInTotal`.
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
|
||||
@@ -1115,9 +1149,22 @@ test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
|
||||
|
||||
**Resolution** — `HandleSubscribeCompleted` now detects the
|
||||
mid-termination race: when `_subscriptionsByInstance.TryGetValue` fails,
|
||||
the method logs a warning, clears each owned tag from
|
||||
`_subscribesInFlight`, fires fire-and-forget
|
||||
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` for every successful
|
||||
non-AlreadySubscribed result (so the OPC UA monitored items don't keep
|
||||
streaming for a tag nobody listens to), and returns without applying
|
||||
counter or handle mutations. Regression test
|
||||
`DCL021_UnsubscribeDuringInFlightSubscribe_ReleasesAdapterHandle_AndKeepsStateClean`
|
||||
parks the subscribe, sends `UnsubscribeTagsRequest`, then releases the
|
||||
subscribe and asserts `_adapter.UnsubscribeAsync` is called and
|
||||
`TotalSubscribedTags == 0`.
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
|
||||
@@ -1170,9 +1217,19 @@ for A while the subscribe I/O is in flight, completes the subscribe, and asserts
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
|
||||
|
||||
**Resolution** — both call sites now gate
|
||||
`Timers.StartPeriodicTimer("tag-resolution-retry", ...)` with
|
||||
`!Timers.IsTimerActive("tag-resolution-retry")` so the first failure
|
||||
arms the timer and subsequent failures only pile onto `_unresolvedTags`.
|
||||
Regression test
|
||||
`DCL022_BurstedFailedSubscribes_DoNotResetRetryTimer` fires 5 failed
|
||||
subscribes within one retry interval and asserts the retry timer
|
||||
actually fires within one interval of the first failure (pre-fix it
|
||||
would have been pushed past the interval boundary by the resets).
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
|
||||
|
||||
+8
-16
@@ -40,10 +40,10 @@ module file and counted in **Total**.
|
||||
| Severity | Open findings |
|
||||
|----------|---------------|
|
||||
| Critical | 0 |
|
||||
| High | 12 |
|
||||
| Medium | 52 |
|
||||
| High | 10 |
|
||||
| Medium | 46 |
|
||||
| Low | 90 |
|
||||
| **Total** | **154** |
|
||||
| **Total** | **146** |
|
||||
|
||||
## Module Status
|
||||
|
||||
@@ -55,8 +55,8 @@ module file and counted in **Total**.
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/4 | 4 | 14 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 23 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 22 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/5 | 10 | 24 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/0 | 5 | 22 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/5 | 9 | 24 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
|
||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/1/5 | 7 | 24 |
|
||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/3 | 6 | 23 |
|
||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/5 | 7 | 23 |
|
||||
@@ -68,7 +68,7 @@ module file and counted in **Total**.
|
||||
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/2 | 2 | 21 |
|
||||
| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 6 |
|
||||
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/2/6 | 9 | 23 |
|
||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/4/3 | 7 | 26 |
|
||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 26 |
|
||||
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/3/3 | 7 | 24 |
|
||||
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/1/4/1 | 6 | 22 |
|
||||
| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/2/2/4 | 8 | 12 |
|
||||
@@ -84,13 +84,11 @@ description, location, recommendation — lives in the module's `findings.md`.
|
||||
|
||||
_None open._
|
||||
|
||||
### High (12)
|
||||
### High (10)
|
||||
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| Communication-016 | [Communication](Communication/findings.md) | `HandleConnectionStateChanged` is dead code — the documented disconnect-cleanup workflow never fires |
|
||||
| ConfigurationDatabase-015 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `NotificationOutboxRepository.InsertIfNotExistsAsync` is a check-then-act race with no duplicate-key catch |
|
||||
| DataConnectionLayer-018 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle |
|
||||
| DeploymentManager-018 | [DeploymentManager](DeploymentManager/findings.md) | Reconciliation force-sets `Enabled`, overwriting an intentional `Disabled` after central failover |
|
||||
| ExternalSystemGateway-018 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message |
|
||||
| InboundAPI-022 | [InboundAPI](InboundAPI/findings.md) | `IActiveNodeGate` has no production registration in Host — standby-node gating is silently disabled in production |
|
||||
@@ -101,7 +99,7 @@ _None open._
|
||||
| Transport-001 | [Transport](Transport/findings.md) | Template Overwrite never syncs attributes / alarms / scripts |
|
||||
| Transport-002 | [Transport](Transport/findings.md) | ExternalSystem Overwrite never syncs methods |
|
||||
|
||||
### Medium (52)
|
||||
### Medium (46)
|
||||
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
@@ -120,10 +118,6 @@ _None open._
|
||||
| ConfigurationDatabase-017 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency |
|
||||
| ConfigurationDatabase-018 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `DateTime`-typed `*Utc` columns on `AuditEvent` / `SiteCall` carry no `DateTimeKind` enforcement |
|
||||
| ConfigurationDatabase-019 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes |
|
||||
| DataConnectionLayer-019 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations |
|
||||
| DataConnectionLayer-020 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe |
|
||||
| DataConnectionLayer-021 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight |
|
||||
| DataConnectionLayer-022 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts |
|
||||
| DeploymentManager-019 | [DeploymentManager](DeploymentManager/findings.md) | Lifecycle command timeout writes no audit entry |
|
||||
| ExternalSystemGateway-019 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default |
|
||||
| ExternalSystemGateway-020 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry |
|
||||
@@ -144,10 +138,8 @@ _None open._
|
||||
| SiteCallAudit-003 | [SiteCallAudit](SiteCallAudit/findings.md) | `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it |
|
||||
| SiteEventLogging-015 | [SiteEventLogging](SiteEventLogging/findings.md) | Background write queue is unbounded; can grow without limit under sustained writer slowness |
|
||||
| SiteEventLogging-017 | [SiteEventLogging](SiteEventLogging/findings.md) | Central client's `PageSize` is unbounded; defeats the "configurable page size" design rationale |
|
||||
| SiteRuntime-020 | [SiteRuntime](SiteRuntime/findings.md) | Second `DeployInstanceCommand` arriving during a pending redeploy races the still-terminating actor on its name |
|
||||
| SiteRuntime-021 | [SiteRuntime](SiteRuntime/findings.md) | `HandleDeployArtifacts` updates `DataConnections` in SQLite but never sends `CreateConnectionCommand` to the DCL |
|
||||
| SiteRuntime-022 | [SiteRuntime](SiteRuntime/findings.md) | `AuditingDbCommand.DbConnection.set` uses reflection to read `AuditingDbConnection._inner` |
|
||||
| SiteRuntime-024 | [SiteRuntime](SiteRuntime/findings.md) | `OperationTrackingStore` serialises all writes through one connection + `SemaphoreSlim`, and `Dispose()` does sync-over-async |
|
||||
| StoreAndForward-019 | [StoreAndForward](StoreAndForward/findings.md) | Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks" |
|
||||
| StoreAndForward-020 | [StoreAndForward](StoreAndForward/findings.md) | `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load |
|
||||
| StoreAndForward-021 | [StoreAndForward](StoreAndForward/findings.md) | Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime |
|
||||
|
||||
@@ -954,9 +954,22 @@ Instance Actor produces no `InstanceLifecycleResponse` for either command
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:285`, `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:971` |
|
||||
|
||||
**Resolution** — added a name → terminating-actor-ref shadow
|
||||
(`_terminatingActorsByName`) populated when `HandleDeploy` stops the
|
||||
predecessor and cleared in `HandleTerminated`. `HandleDeploy` now
|
||||
detects the mid-termination state before falling through to
|
||||
`ApplyDeployment(fresh)`: on hit it tells the displaced
|
||||
`PendingRedeploy.OriginalSender` a `DeploymentStatus.Failed` /
|
||||
"superseded by newer deployment …" response and overwrites the buffered
|
||||
pending command (last-write-wins). Regression test
|
||||
`SR020_ThreeRapidDeploys_DoNotThrowInvalidActorNameException_LatestWins`
|
||||
fires three rapid deploys, asserts the middle deploy is told it was
|
||||
superseded, the latest succeeds, and the resulting instance is operable
|
||||
(DisableInstanceCommand works).
|
||||
|
||||
**Description**
|
||||
|
||||
The SiteRuntime-003 fix makes `HandleDeploy` watch + stop a running Instance
|
||||
@@ -1181,9 +1194,24 @@ on the host's regional settings.
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:39`, `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:360` |
|
||||
|
||||
**Resolution** — split reads from writes: the single owned
|
||||
`_writeConnection` + `_writeGate` still serialises writers, but
|
||||
`GetStatusAsync` now opens a fresh `SqliteConnection` per call against
|
||||
the shared connection string (mirroring `SiteStorageService`) so reads
|
||||
never block on an in-flight write. Sync `Dispose` was rewritten to NOT
|
||||
bridge to async — the dispose-once flag is an `int` flipped with
|
||||
`Interlocked.Exchange`, the synchronous path disposes
|
||||
`_writeConnection` + `_writeGate` directly without acquiring the gate,
|
||||
and `DisposeAsync` retains the gate-drain semantics for graceful
|
||||
shutdown. Both paths are idempotent; the second call short-circuits via
|
||||
the interlocked flag. Tests:
|
||||
`SR024_ConcurrentReads_DoNotBlockOnInFlightWrite`,
|
||||
`SR024_SyncDispose_DoesNotDeadlock_WhenInvokedFromFreshThread`, and
|
||||
`SR024_AsyncDispose_DoesNotDeadlock_AndIsIdempotent`.
|
||||
|
||||
**Description**
|
||||
|
||||
`OperationTrackingStore` owns exactly one `SqliteConnection` and gates every
|
||||
|
||||
Reference in New Issue
Block a user