Files

Joseph Doherty 239bee3bc4 fix(data-connection): resolve DataConnectionLayer-001 — off-thread actor state mutation

HandleSubscribe spawned a Task.Run that mutated DataConnectionActor private
state (_subscriptionIds, _subscriptionsByInstance, _totalSubscribed,
_resolvedTags, _unresolvedTags) from a thread-pool thread, racing the actor's
own message loop — a data race on non-thread-safe Dictionary/HashSet and
non-atomic counters.

Restructured HandleSubscribe to follow the actor's existing PipeTo(Self)
pattern: the background task now performs only adapter I/O and pipes a
SubscribeCompleted message to Self; all subscription-state mutation happens
in the new HandleSubscribeCompleted handler on the actor thread (wired into
the Connected, Connecting and Reconnecting states).

Adds DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30x30
concurrent subscribes) which fails against the pre-fix code and passes after.

2026-05-16 18:26:43 -04:00

23 KiB

Raw Blame History

Code Review — DataConnectionLayer

Field	Value
Module	`src/ScadaLink.DataConnectionLayer`
Design doc	`docs/requirements/Component-DataConnectionLayer.md`
Status	Reviewed
Last reviewed	2026-05-16
Reviewer	claude-agent
Commit reviewed	`9c60592`
Open findings	12

Summary

The DataConnectionLayer is a reasonably well-structured module: the Become/Stash lifecycle state machine, the captured-Self marshalling of background-thread disconnect events, and the protocol-factory abstraction all follow the design doc and Akka.NET conventions. However, the review found one critical actor-model violation — HandleSubscribe spawns a Task.Run that mutates the actor's private dictionaries and counters from a thread-pool thread, racing with the actor's own message loop. Several high-severity issues cluster around concurrency and error handling: the subscription-failure path leaves the connection with degraded subtrees but no real recovery, the DataConnectionManagerActor's Restart supervision drops all subscription state on a connection-actor crash, and RealOpcUaClient's monitored- item callback dictionary is mutated without synchronization while OPC UA notification threads read it. The remaining findings concern stale health counters after failover, an unused WriteTimeout option (writes are unbounded despite the design promising a 30 s timeout), ReadBatchAsync aborting mid-batch, and documentation drift between the design doc's failover state machine and the implemented unstable-disconnect heuristic. Test coverage is adequate for the happy paths and failover but absent for tag-resolution retry, disconnect/re-subscribe, and concurrency around HandleSubscribe.

Checklist coverage

#	Category	Examined	Notes
1	Correctness & logic bugs	x	`_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch.
2	Akka.NET conventions	x	`Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`.
3	Concurrency & thread safety	x	Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized.
4	Error handling & resilience	x	Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled.
5	Security	x	`AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012.
6	Performance & resource management	x	`HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag.
7	Design-document adherence	x	Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused.
8	Code organization & conventions	x	No issues found — POCOs in Commons, options class owned by component, factory pattern consistent.
9	Testing coverage	x	No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency.
10	Documentation & comments	x	XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale.

Findings

DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread


Severity	Critical
Category	Concurrency & thread safety
Status	Resolved
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538`

Description

HandleSubscribe launches a Task.Run(async () => ...) that runs on a thread-pool thread and directly mutates the actor's private mutable state: instanceTags (a reference into _subscriptionsByInstance), _subscriptionIds, _totalSubscribed, _resolvedTags, and _unresolvedTags. All of these are simultaneously read and written by the actor's own message loop (HandleTagValueReceived, HandleUnsubscribe, ReSubscribeAll, HandleRetryTagResolution, ReplyWithHealthReport). This is a direct violation of the Akka.NET actor model, which guarantees single-threaded access to actor state only when state is touched on the actor thread. Two concurrent subscribe requests, or a subscribe overlapping a TagValueReceived / GetHealthReport, produce data races on Dictionary/HashSet/int — Dictionary is not thread-safe and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can also produce torn reads of the health counters.

Recommendation

Do not mutate actor state from the background task. Perform only the await _adapter.SubscribeAsync(...) / ReadAsync(...) I/O in the task, collect the results into a local immutable result object, and PipeTo(Self) an internal message (e.g. SubscribeCompleted) whose handler — running on the actor thread — applies all state mutations and counter updates. The response to Sender should be sent from that handler too.

Resolution

Resolved 2026-05-16. HandleSubscribe was restructured to follow the actor's own PipeTo(Self) pattern (the one already used by HandleRetryTagResolution): the background Task.Run now performs only adapter I/O (SubscribeAsync/ReadAsync), collects per-tag outcomes into an immutable SubscribeCompleted message, and pipes that to Self. All mutation of _subscriptionIds, _subscriptionsByInstance, _totalSubscribed, _resolvedTags and _unresolvedTags now happens in the new HandleSubscribeCompleted handler on the actor thread; it is wired into the Connected, Connecting and Reconnecting states so an in-flight subscribe is applied regardless of state transitions. Regression test DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30×30 concurrent subscribes) fails against the pre-fix code and passes after. Fixed by the commit whose message references DataConnectionLayer-001.

DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash


Severity	High
Category	Akka.NET conventions
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141`

Description

DataConnectionManagerActor.SupervisorStrategy returns a OneForOneStrategy with Directive.Restart for DataConnectionActor failures. On restart, Akka.NET creates a fresh actor instance, so all in-memory fields — _subscriptionsByInstance, _subscriptionIds, _subscribers, _unresolvedTags, the quality counters — are silently discarded. The actor re-enters Connecting with zero subscriptions, and the design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors that had subscribed before the crash never get their tags re-subscribed and will sit at uncertain/stale quality indefinitely with no error returned. There is no durable subscription store from which a restarted actor could rebuild state.

Recommendation

Either (a) make the subscription registry durable/recoverable so a restarted actor can rebuild it (persist to local SQLite as the design doc says connection definitions are, and have PreStart reload subscriptions), or (b) treat a connection-actor crash as a lifecycle event the DataConnectionManagerActor notices, so it can re-issue the subscription registrations. At minimum document that subscribers must re-register after a crash and surface the lost-state condition rather than failing silently.

Resolution

Unresolved.

DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization


Severity	High
Category	Concurrency & thread safety
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184`

Description

_monitoredItems and _callbacks are plain Dictionary<,> instances. They are written from CreateSubscriptionAsync / RemoveSubscriptionAsync (invoked from the DataConnectionActor's Task.Run / ContinueWith continuations, i.e. thread-pool threads) and from DisconnectAsync (.Clear()), while being read concurrently from the OPC Foundation SDK's MonitoredItem.Notification event handler, which fires on the SDK's internal publish threads (_callbacks.TryGetValue(handle, ...) at line 163). Concurrent reads during a Dictionary resize or Clear() are undefined behaviour — they can throw InvalidOperationException, return wrong entries, or corrupt the dictionary. The DataConnectionActor's subscribe path already runs off the actor thread (finding 001), so multiple subscribe calls can also race each other here.

Recommendation

Use ConcurrentDictionary<,> for _monitoredItems and _callbacks, or guard all access with a lock. Note that fixing finding 001 (serialising subscribe through the actor thread) reduces but does not eliminate the race, because the SDK notification threads still read _callbacks concurrently with RemoveSubscriptionAsync / DisconnectAsync.

Resolution

Unresolved.


Severity	High
Category	Error handling & resilience
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537`

Description

When _adapter.SubscribeAsync throws inside the HandleSubscribe background task, the catch block adds the tag to _unresolvedTags and increments _totalSubscribed, treating every subscribe exception as a tag-resolution failure. But SubscribeAsync also throws InvalidOperationException from EnsureConnected() when the OPC UA client is not connected, and throws on transport faults — these are connection problems, not bad tag paths. They get misclassified as unresolved tags and retried on the 10 s tag-resolution timer instead of triggering the reconnection state machine. Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute must be marked quality bad; the code never pushes a bad-quality update to the subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor stays at uncertain quality with no signal. The TagResolutionFailed message it sends to Self only logs and re-arms the timer (HandleTagResolutionFailed).

Recommendation

Distinguish connection-level exceptions (raise AdapterDisconnected / let the reconnect machine handle them) from genuine node-not-found errors. For genuine resolution failures, push a TagValueUpdate with QualityCode.Bad to the subscribing Instance Actor so it reflects the documented behaviour.

Resolution

Unresolved.

DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied


Severity	High
Category	Design-document adherence
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590`

Description

DataConnectionOptions.WriteTimeout (default 30 s) and the design doc's "Shared Settings" table both promise a bounded timeout for synchronous device writes. The value is never read anywhere in the module (grep confirms only the declaration). HandleWrite calls _adapter.WriteAsync(request.TagPath, request.Value) with no CancellationToken and no timeout. If the OPC UA server hangs (TCP black-hole, no RST), the write Task never completes, PipeTo(sender) never fires, and the calling script's Ask blocks until its own ask-timeout — and the script gets no DCL-level error. The design states write failures (including timeout) must be returned synchronously to the script; an unbounded write violates that.

Recommendation

Create a CancellationTokenSource(_options.WriteTimeout), pass its token to WriteAsync, and in the continuation translate cancellation into a failed WriteTagResponse with a timeout error message. Apply the same to the read used by the initial-value seed and to WriteBatchAndWaitAsync paths if they are reachable.

Resolution

Unresolved.


Severity	Medium
Category	Correctness & logic bugs
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756`

Description

ReSubscribeAll resets _subscriptionIds, _unresolvedTags and _resolvedTags to a clean slate, but leaves _lastTagQuality, _tagsGoodQuality, _tagsBadQuality and _tagsUncertainQuality untouched. PushBadQualityForAllTags (called on disconnect) sets _tagsBadQuality = _lastTagQuality.Count and zeroes the others. After a reconnect, HandleTagValueReceived decrements the old bucket using _lastTagQuality's value and increments the new one — but tags resolved for the first time after reconnect were never in _lastTagQuality, so they only increment, never decrement, and the totals can drift above _totalSubscribed. Over repeated disconnect/reconnect cycles the health report's good/bad/uncertain counts become unreliable.

Recommendation

On BecomeConnected after a re-subscribe (or in ReSubscribeAll), clear _lastTagQuality and the three quality counters and let them be repopulated from fresh TagValueReceived messages. Alternatively recompute the buckets from _lastTagQuality whenever it changes rather than maintaining incremental counters.

Resolution

Unresolved.

DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag


Severity	Medium
Category	Correctness & logic bugs
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195`

Description

ReadBatchAsync loops calling ReadAsync per tag. ReadAsync re-throws any non-cancellation exception (line 184). So if any single tag in the batch throws (bad node, transient fault), the entire ReadBatchAsync throws and the caller gets no results for the tags that did read successfully — even though ReadResult already has a Success/ErrorMessage shape designed to carry per-tag failures. The batch is also fully serial (one round-trip per tag), defeating the point of a batch API; the design doc lists ReadBatch/WriteBatch as first-class operations.

Recommendation

Catch per-tag exceptions inside the loop and store a failed ReadResult for that tag so the batch returns a complete map. Ideally issue a single OPC UA Read service call for all node IDs (RealOpcUaClient.ReadValueAsync already builds a ReadValueIdCollection — extend it to accept multiple nodes).

Resolution

Unresolved.

DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly


Severity	Low
Category	Performance & resource management
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569`

Description

For each tag of the instance being removed, HandleUnsubscribe scans every other instance's tag set (_subscriptionsByInstance.Where(...).Any()), making the operation O(tags x instances). On a site with many instances sharing a connection this is needlessly expensive on every instance stop/redeploy. Separately, line 562 re-evaluates !_unresolvedTags.Contains(tagPath) immediately after line 561 already removed tagPath from _unresolvedTags, so the condition is always true — dead logic that obscures intent (the decrement of _resolvedTags is unconditional in practice).

Recommendation

Maintain a reference count per tag path (or a tagPath -> set<instance> reverse index) so the "any other subscriber" check is O(1). Remove the redundant _unresolvedTags re-check or restructure so the resolved/unresolved decrement reflects the tag's actual prior state captured before removal.

Resolution

Unresolved.

DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine


Severity	Medium
Category	Design-document adherence
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85`

Description

The design doc's failover state machine reads "retry active endpoint (5s) -> N failures (>= FailoverRetryCount) -> switch to other endpoint". The code implements two separate failover triggers: (a) HandleReconnectResult counts _consecutiveFailures on connect-attempt failures (matches the doc), and (b) BecomeReconnecting additionally counts _consecutiveUnstableDisconnects — connections that succeeded but dropped within a hard-coded 60 s StableConnectionThreshold — and fails over on that count too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can happen on successful-but-flaky connections are not described in the component doc at all. A reviewer or operator reading Component-DataConnectionLayer.md would not predict this behaviour, and the 60 s threshold is a magic constant not exposed via DataConnectionOptions.

Recommendation

Update Component-DataConnectionLayer.md to document the unstable-disconnect failover path and the stability threshold, and move the 60 s threshold into DataConnectionOptions so it is configurable and consistent with the other tunables.

Resolution

Unresolved.


Severity	Medium
Category	Correctness & logic bugs
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703`

Description

HandleRetryTagResolution fires SubscribeAsync for every tag in _unresolvedTags via ContinueWith(...).PipeTo(self), but does not remove the tags from _unresolvedTags while the attempts are in flight. Because tags are not removed before the retry, a slow SubscribeAsync overlapping the next 10 s tick issues duplicate concurrent subscribe attempts for the same tag, which can create duplicate monitored items / leaked subscription IDs (the second success overwrites _subscriptionIds[tag] in HandleTagResolutionSucceeded, orphaning the first handle with no UnsubscribeAsync call). The timer-cancel condition in HandleTagResolutionSucceeded is also non-deterministic for the same reason.

Recommendation

Remove tags from _unresolvedTags (into an "in-flight" set) when a retry is dispatched, and only put them back on failure. This prevents overlapping duplicate subscribe attempts and makes the timer-cancel condition deterministic.

Resolution

Unresolved.

DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor


Severity	Medium
Category	Error handling & resilience
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262`

Description

On failover the actor disposes the old adapter (_adapter.DisposeAsync(), fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks captured self and tagPath and Tell TagValueReceived to the actor. While the Reconnecting handler ignores TagValueReceived (line 334), once the actor reaches Connected again it processes them — and a disposed adapter whose OPC UA SDK threads have not yet fully torn down could still deliver a value, mixing pre-failover device data with the new endpoint's data and briefly reporting a value the active endpoint never produced. There is no per-adapter generation/epoch tag on TagValueReceived to distinguish current from stale callbacks.

Recommendation

Add an adapter-generation counter incremented on every adapter swap; stamp it onto TagValueReceived (captured in the callback closure) and drop messages whose generation does not match the current adapter in HandleTagValueReceived.

Resolution

Unresolved.

DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate


Severity	Medium
Category	Security
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116`

Description

OpcUaConnectionOptions.AutoAcceptUntrustedCerts defaults to true, and RealOpcUaClient.ConnectAsync wires CertificateValidator.CertificateValidation += (_, e) => e.Accept = true when it is set. With the default, every server certificate is accepted unconditionally — there is no certificate-pinning or trust-store enforcement — which defeats the Sign/SignAndEncrypt security modes against an active man-in-the-middle on the OPC UA link. The design doc explicitly lists true as the default. For an industrial control link this is a meaningful exposure; a secure-by-default posture would reject untrusted certs unless an operator opts in per connection.

Recommendation

Default AutoAcceptUntrustedCerts to false and require explicit per-connection opt-in, or at minimum log a prominent warning whenever the auto-accept validator is installed. Update the design doc to reflect the secure default.

Resolution

Unresolved.

DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide


Severity	Low
Category	Documentation & comments
Status	Open
Location	`src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281`

Description

The XML doc on RaiseDisconnected states "Thread-safe: only the first caller triggers the event." The implementation is a non-atomic check-then-set on a volatile bool (if (_disconnectFired) return; _disconnectFired = true;). volatile guarantees visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via OnClientConnectionLost and a ReadAsync failure path) can both observe _disconnectFired == false and both invoke Disconnected. In practice the DataConnectionActor tolerates a duplicate AdapterDisconnected message, so impact is low, but the comment overstates the guarantee. The same pattern exists in RealOpcUaClient.OnSessionKeepAlive (_connectionLostFired).

Recommendation

Either make the guard atomic (Interlocked.Exchange with an int flag, or a lock), or correct the comment to say "best-effort once-only; a duplicate event is possible under a race and is tolerated downstream."

Resolution

Unresolved.

23 KiB Raw Blame History Unescape Escape

Code Review — DataConnectionLayer

Summary

Checklist coverage

Findings

DataConnectionLayer-001 — Task.Run in HandleSubscribe mutates actor state off the actor thread

DataConnectionLayer-002 — Restart supervision discards all subscription state on connection-actor crash

DataConnectionLayer-003 — RealOpcUaClient callback/monitored-item dictionaries mutated without synchronization

DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly

DataConnectionLayer-005 — WriteTimeout option is documented and configured but never applied

DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe

DataConnectionLayer-007 — ReadBatchAsync aborts the whole batch on the first failing tag

DataConnectionLayer-008 — HandleUnsubscribe is O(n^2) over instances and rechecks _unresolvedTags redundantly

DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine

DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts

DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor

DataConnectionLayer-012 — AutoAcceptUntrustedCerts defaults to true, accepting any server certificate

DataConnectionLayer-013 — Misleading XML comment: RaiseDisconnected claims thread safety it does not provide

23 KiB

Raw Blame History

DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread

DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash

DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization

DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied

DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag

DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly

DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate

DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide