CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL
IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the
at-least-once livelock on the site→central notification handoff.
DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent
same-tag subscribes don't orphan an adapter handle; delete the latent
dead _subscriptionHandles dictionary; stop double-counting
_totalSubscribed when an unresolved tag is promoted via another instance;
release adapter handles on mid-flight unsubscribe; gate the
tag-resolution retry timer with IsTimerActive so subscribe bursts don't
reset it into starvation.
SR-020: add _terminatingActorsByName shadow so a third deploy arriving
during a pending redeploy doesn't crash on InvalidActorNameException —
displaced senders get a Failed/superseded response and the latest
command wins on Terminated.
SR-024: split OperationTrackingStore reads from writes (fresh
SqliteConnection per GetStatusAsync) so long writes don't block status
queries; rewrite Dispose to drop the sync-over-async bridge that could
deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the
dispose-once flag race-safe across both paths.
HandleSubscribe spawned a Task.Run that mutated DataConnectionActor private
state (_subscriptionIds, _subscriptionsByInstance, _totalSubscribed,
_resolvedTags, _unresolvedTags) from a thread-pool thread, racing the actor's
own message loop — a data race on non-thread-safe Dictionary/HashSet and
non-atomic counters.
Restructured HandleSubscribe to follow the actor's existing PipeTo(Self)
pattern: the background task now performs only adapter I/O and pipes a
SubscribeCompleted message to Self; all subscription-state mutation happens
in the new HandleSubscribeCompleted handler on the actor thread (wired into
the Connected, Connecting and Reconnecting states).
Adds DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30x30
concurrent subscribes) which fails against the pre-fix code and passes after.
Previously, failover only triggered when ConnectAsync failed consecutively.
If a connection succeeded but went stale quickly (e.g., heartbeat timeout),
the failure counter reset on each successful connect and failover never
triggered.
Added a separate _consecutiveUnstableDisconnects counter that increments
when a connection lasts less than StableConnectionThreshold (60s) before
disconnecting. When this counter reaches failoverRetryCount, the actor
fails over to the backup endpoint. Stable connections (lasting >60s)
reset this counter.
The original connection-failure failover path is unchanged.
DataConnectionActor now calls UpdateConnectionHealth() on state
transitions (Connecting/Connected/Reconnecting) and UpdateTagResolution()
on connection establishment. DataConnectionManagerActor calls
RemoveConnection() on actor removal. Health reports now include
data connection statuses when instances are deployed with bindings.