27 KiB
Code Review — DataConnectionLayer
| Field | Value |
|---|---|
| Module | src/ScadaLink.DataConnectionLayer |
| Design doc | docs/requirements/Component-DataConnectionLayer.md |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | 9c60592 |
| Open findings | 8 |
Summary
The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
lifecycle state machine, the captured-Self marshalling of background-thread
disconnect events, and the protocol-factory abstraction all follow the design doc
and Akka.NET conventions. However, the review found one critical actor-model
violation — HandleSubscribe spawns a Task.Run that mutates the actor's private
dictionaries and counters from a thread-pool thread, racing with the actor's own
message loop. Several high-severity issues cluster around concurrency and error
handling: the subscription-failure path leaves the connection with degraded subtrees
but no real recovery, the DataConnectionManagerActor's Restart supervision drops
all subscription state on a connection-actor crash, and RealOpcUaClient's monitored-
item callback dictionary is mutated without synchronization while OPC UA notification
threads read it. The remaining findings concern stale health counters after failover,
an unused WriteTimeout option (writes are unbounded despite the design promising a
30 s timeout), ReadBatchAsync aborting mid-batch, and documentation drift between
the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around HandleSubscribe.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | x | _resolvedTags double-counting and stale counters after failover; ReadBatchAsync aborts mid-batch. |
| 2 | Akka.NET conventions | x | Task.Run mutating actor state (critical); Restart supervision loses state; closures capturing _subscriptionsByInstance. |
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; RealOpcUaClient callback dictionary unsynchronized. |
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
| 5 | Security | x | AutoAcceptUntrustedCerts defaults to true; OPC UA password handling acceptable. See finding 012. |
| 6 | Performance & resource management | x | HandleUnsubscribe O(n^2) over instances; initial-read loop serial per tag. |
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; WriteTimeout documented but unused. |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or HandleSubscribe concurrency. |
| 10 | Documentation & comments | x | XML comment on RaiseDisconnected claims thread safety it does not have; design doc round-robin description stale. |
Findings
DataConnectionLayer-001 — Task.Run in HandleSubscribe mutates actor state off the actor thread
| Severity | Critical |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538 |
Description
HandleSubscribe launches a Task.Run(async () => ...) that runs on a thread-pool
thread and directly mutates the actor's private mutable state: instanceTags (a
reference into _subscriptionsByInstance), _subscriptionIds, _totalSubscribed,
_resolvedTags, and _unresolvedTags. All of these are simultaneously read and
written by the actor's own message loop (HandleTagValueReceived, HandleUnsubscribe,
ReSubscribeAll, HandleRetryTagResolution, ReplyWithHealthReport). This is a
direct violation of the Akka.NET actor model, which guarantees single-threaded access
to actor state only when state is touched on the actor thread. Two concurrent
subscribe requests, or a subscribe overlapping a TagValueReceived / GetHealthReport,
produce data races on Dictionary/HashSet/int — Dictionary is not thread-safe
and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
also produce torn reads of the health counters.
Recommendation
Do not mutate actor state from the background task. Perform only the await _adapter.SubscribeAsync(...) / ReadAsync(...) I/O in the task, collect the results
into a local immutable result object, and PipeTo(Self) an internal message (e.g.
SubscribeCompleted) whose handler — running on the actor thread — applies all state
mutations and counter updates. The response to Sender should be sent from that
handler too.
Resolution
Resolved 2026-05-16. HandleSubscribe was restructured to follow the actor's own
PipeTo(Self) pattern (the one already used by HandleRetryTagResolution): the
background Task.Run now performs only adapter I/O (SubscribeAsync/ReadAsync),
collects per-tag outcomes into an immutable SubscribeCompleted message, and pipes
that to Self. All mutation of _subscriptionIds, _subscriptionsByInstance,
_totalSubscribed, _resolvedTags and _unresolvedTags now happens in the new
HandleSubscribeCompleted handler on the actor thread; it is wired into the
Connected, Connecting and Reconnecting states so an in-flight subscribe is applied
regardless of state transitions. Regression test
DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30×30 concurrent
subscribes) fails against the pre-fix code and passes after. Fixed by the commit
whose message references DataConnectionLayer-001.
DataConnectionLayer-002 — Restart supervision discards all subscription state on connection-actor crash
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141 |
Description
DataConnectionManagerActor.SupervisorStrategy returns a OneForOneStrategy with
Directive.Restart for DataConnectionActor failures. On restart, Akka.NET creates a
fresh actor instance, so all in-memory fields — _subscriptionsByInstance,
_subscriptionIds, _subscribers, _unresolvedTags, the quality counters — are
silently discarded. The actor re-enters Connecting with zero subscriptions, and the
design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
that had subscribed before the crash never get their tags re-subscribed and will sit
at uncertain/stale quality indefinitely with no error returned. There is no durable
subscription store from which a restarted actor could rebuild state.
Recommendation
Either (a) make the subscription registry durable/recoverable so a restarted actor
can rebuild it (persist to local SQLite as the design doc says connection definitions
are, and have PreStart reload subscriptions), or (b) treat a connection-actor crash
as a lifecycle event the DataConnectionManagerActor notices, so it can re-issue the
subscription registrations. At minimum document that subscribers must re-register
after a crash and surface the lost-state condition rather than failing silently.
Resolution
Resolved 2026-05-16. The DataConnectionManagerActor.SupervisorStrategy was changed
from Directive.Restart to Directive.Resume for DataConnectionActor failures.
Resume keeps the existing actor instance and all its in-memory subscription state
(_subscriptionsByInstance, _subscriptionIds, _subscribers, quality counters)
intact across a transient handler exception, so the design doc's "transparent
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
coordinator and its own Become/Stash reconnect state machine already recovers
connection-level faults — it does not need a restart. This also aligns with the
ScadaLink convention of Resume for coordinator actors. Regression test
DCL002_ConnectionActorCrash_PreservesSubscriptionState crashes the connection actor
via a synchronously-throwing write and asserts the subscription survives (health
report still shows 1 subscribed/resolved tag); it fails against the pre-fix Restart
code and passes after. Fixed by the commit whose message references
DataConnectionLayer-002 (commit <pending>).
DataConnectionLayer-003 — RealOpcUaClient callback/monitored-item dictionaries mutated without synchronization
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184 |
Description
_monitoredItems and _callbacks are plain Dictionary<,> instances. They are
written from CreateSubscriptionAsync / RemoveSubscriptionAsync (invoked from the
DataConnectionActor's Task.Run / ContinueWith continuations, i.e. thread-pool
threads) and from DisconnectAsync (.Clear()), while being read concurrently from
the OPC Foundation SDK's MonitoredItem.Notification event handler, which fires on
the SDK's internal publish threads (_callbacks.TryGetValue(handle, ...) at line
163). Concurrent reads during a Dictionary resize or Clear() are undefined
behaviour — they can throw InvalidOperationException, return wrong entries, or
corrupt the dictionary. The DataConnectionActor's subscribe path already runs off
the actor thread (finding 001), so multiple subscribe calls can also race each other
here.
Recommendation
Use ConcurrentDictionary<,> for _monitoredItems and _callbacks, or guard all
access with a lock. Note that fixing finding 001 (serialising subscribe through the
actor thread) reduces but does not eliminate the race, because the SDK notification
threads still read _callbacks concurrently with RemoveSubscriptionAsync /
DisconnectAsync.
Resolution
Resolved 2026-05-16. _monitoredItems and _callbacks in RealOpcUaClient were
changed from plain Dictionary<,> to ConcurrentDictionary<,>, and the two
Remove(key) call sites switched to TryRemove. This makes the maps safe to read
from the OPC Foundation SDK's publish threads (MonitoredItem.Notification reading
_callbacks) concurrently with subscribe/disconnect mutations on other threads.
RealOpcUaClient wraps concrete OPC Foundation SDK types (ISession,
Subscription, MonitoredItem) and cannot be exercised without a live OPC UA
server, so the regression is guarded structurally by
DCL003_SharedDictionaryFields_AreConcurrentCollections (a reflection test asserting
both fields are ConcurrentDictionary<,>); it fails against the pre-fix Dictionary
code and passes after. Fixed by the commit whose message references
DataConnectionLayer-003 (commit <pending>).
DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537 |
Description
When _adapter.SubscribeAsync throws inside the HandleSubscribe background task,
the catch block adds the tag to _unresolvedTags and increments _totalSubscribed,
treating every subscribe exception as a tag-resolution failure. But SubscribeAsync
also throws InvalidOperationException from EnsureConnected() when the OPC UA
client is not connected, and throws on transport faults — these are connection
problems, not bad tag paths. They get misclassified as unresolved tags and retried on
the 10 s tag-resolution timer instead of triggering the reconnection state machine.
Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
must be marked quality bad; the code never pushes a bad-quality update to the
subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
stays at uncertain quality with no signal. The TagResolutionFailed message it sends
to Self only logs and re-arms the timer (HandleTagResolutionFailed).
Recommendation
Distinguish connection-level exceptions (raise AdapterDisconnected / let the
reconnect machine handle them) from genuine node-not-found errors. For genuine
resolution failures, push a TagValueUpdate with QualityCode.Bad to the subscribing
Instance Actor so it reflects the documented behaviour.
Resolution
Resolved 2026-05-16. The subscribe background task now classifies each subscribe
exception via the new IsConnectionLevelFailure helper (InvalidOperationException
— thrown by EnsureConnected() — plus SocketException/TimeoutException/
IOException count as connection-level; anything else is a genuine resolution
failure). The classification is carried on SubscribeTagResult.ConnectionLevelFailure
and applied on the actor thread in HandleSubscribeCompleted: connection-level
failures no longer become unresolved tags and instead drive the reconnection state
machine (HandleSubscribeCompleted returns a flag and the Connected-state handler
calls BecomeReconnecting); genuine resolution failures still go to _unresolvedTags
and the retry timer, and now also push a TagValueUpdate with QualityCode.Bad to
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
Regression tests DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber
and DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry both fail
against the pre-fix code and pass after. Fixed by the commit whose message references
DataConnectionLayer-004 (commit <pending>).
DataConnectionLayer-005 — WriteTimeout option is documented and configured but never applied
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15, src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590 |
Description
DataConnectionOptions.WriteTimeout (default 30 s) and the design doc's "Shared
Settings" table both promise a bounded timeout for synchronous device writes. The
value is never read anywhere in the module (grep confirms only the declaration).
HandleWrite calls _adapter.WriteAsync(request.TagPath, request.Value) with no
CancellationToken and no timeout. If the OPC UA server hangs (TCP black-hole, no
RST), the write Task never completes, PipeTo(sender) never fires, and the calling
script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
error. The design states write failures (including timeout) must be returned
synchronously to the script; an unbounded write violates that.
Recommendation
Create a CancellationTokenSource(_options.WriteTimeout), pass its token to
WriteAsync, and in the continuation translate cancellation into a failed
WriteTagResponse with a timeout error message. Apply the same to the read used by
the initial-value seed and to WriteBatchAndWaitAsync paths if they are reachable.
Resolution
Resolved 2026-05-16. HandleWrite now creates a CancellationTokenSource(_options.WriteTimeout),
passes its token to _adapter.WriteAsync(...), and disposes the source in the
continuation. A cancelled/timed-out write (Task.IsCanceled or a base
OperationCanceledException) is translated into a failed WriteTagResponse with a
"Write timeout after Ns" message, so a hung device write is bounded and the failure
is returned synchronously to the calling script (WP-11) instead of blocking until the
script's own Ask-timeout. (The WriteBatchAndWaitAsync adapter path already accepts
an explicit timeout/CancellationToken and is not invoked by HandleWrite, so no
change was needed there.) Regression test
DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously uses an adapter whose
WriteAsync only completes when its token fires; it fails against the pre-fix
unbounded code and passes after. Fixed by the commit whose message references
DataConnectionLayer-005 (commit <pending>).
DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756 |
Description
ReSubscribeAll resets _subscriptionIds, _unresolvedTags and _resolvedTags to a
clean slate, but leaves _lastTagQuality, _tagsGoodQuality, _tagsBadQuality and
_tagsUncertainQuality untouched. PushBadQualityForAllTags (called on disconnect)
sets _tagsBadQuality = _lastTagQuality.Count and zeroes the others. After a
reconnect, HandleTagValueReceived decrements the old bucket using
_lastTagQuality's value and increments the new one — but tags resolved for the first
time after reconnect were never in _lastTagQuality, so they only increment, never
decrement, and the totals can drift above _totalSubscribed. Over repeated
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
unreliable.
Recommendation
On BecomeConnected after a re-subscribe (or in ReSubscribeAll), clear
_lastTagQuality and the three quality counters and let them be repopulated from
fresh TagValueReceived messages. Alternatively recompute the buckets from
_lastTagQuality whenever it changes rather than maintaining incremental counters.
Resolution
Unresolved.
DataConnectionLayer-007 — ReadBatchAsync aborts the whole batch on the first failing tag
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195 |
Description
ReadBatchAsync loops calling ReadAsync per tag. ReadAsync re-throws any
non-cancellation exception (line 184). So if any single tag in the batch throws (bad
node, transient fault), the entire ReadBatchAsync throws and the caller gets no
results for the tags that did read successfully — even though ReadResult already
has a Success/ErrorMessage shape designed to carry per-tag failures. The batch is
also fully serial (one round-trip per tag), defeating the point of a batch API; the
design doc lists ReadBatch/WriteBatch as first-class operations.
Recommendation
Catch per-tag exceptions inside the loop and store a failed ReadResult for that tag
so the batch returns a complete map. Ideally issue a single OPC UA Read service call
for all node IDs (RealOpcUaClient.ReadValueAsync already builds a
ReadValueIdCollection — extend it to accept multiple nodes).
Resolution
Unresolved.
DataConnectionLayer-008 — HandleUnsubscribe is O(n^2) over instances and rechecks _unresolvedTags redundantly
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569 |
Description
For each tag of the instance being removed, HandleUnsubscribe scans every other
instance's tag set (_subscriptionsByInstance.Where(...).Any()), making the operation
O(tags x instances). On a site with many instances sharing a connection this is
needlessly expensive on every instance stop/redeploy. Separately, line 562
re-evaluates !_unresolvedTags.Contains(tagPath) immediately after line 561 already
removed tagPath from _unresolvedTags, so the condition is always true — dead
logic that obscures intent (the decrement of _resolvedTags is unconditional in
practice).
Recommendation
Maintain a reference count per tag path (or a tagPath -> set<instance> reverse index)
so the "any other subscriber" check is O(1). Remove the redundant _unresolvedTags
re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
prior state captured before removal.
Resolution
Unresolved.
DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449, docs/requirements/Component-DataConnectionLayer.md:73-85 |
Description
The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
(>= FailoverRetryCount) -> switch to other endpoint". The code implements two separate
failover triggers: (a) HandleReconnectResult counts _consecutiveFailures on
connect-attempt failures (matches the doc), and (b) BecomeReconnecting additionally
counts _consecutiveUnstableDisconnects — connections that succeeded but dropped
within a hard-coded 60 s StableConnectionThreshold — and fails over on that count
too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
happen on successful-but-flaky connections are not described in the component doc at
all. A reviewer or operator reading Component-DataConnectionLayer.md would not
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
DataConnectionOptions.
Recommendation
Update Component-DataConnectionLayer.md to document the unstable-disconnect failover
path and the stability threshold, and move the 60 s threshold into
DataConnectionOptions so it is configurable and consistent with the other tunables.
Resolution
Unresolved.
DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703 |
Description
HandleRetryTagResolution fires SubscribeAsync for every tag in _unresolvedTags
via ContinueWith(...).PipeTo(self), but does not remove the tags from
_unresolvedTags while the attempts are in flight. Because tags are not removed
before the retry, a slow SubscribeAsync overlapping the next 10 s tick issues
duplicate concurrent subscribe attempts for the same tag, which can create duplicate
monitored items / leaked subscription IDs (the second success overwrites
_subscriptionIds[tag] in HandleTagResolutionSucceeded, orphaning the first handle
with no UnsubscribeAsync call). The timer-cancel condition in
HandleTagResolutionSucceeded is also non-deterministic for the same reason.
Recommendation
Remove tags from _unresolvedTags (into an "in-flight" set) when a retry is
dispatched, and only put them back on failure. This prevents overlapping duplicate
subscribe attempts and makes the timer-cancel condition deterministic.
Resolution
Unresolved.
DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425, src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262 |
Description
On failover the actor disposes the old adapter (_adapter.DisposeAsync(),
fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
captured self and tagPath and Tell TagValueReceived to the actor. While the
Reconnecting handler ignores TagValueReceived (line 334), once the actor reaches
Connected again it processes them — and a disposed adapter whose OPC UA SDK threads
have not yet fully torn down could still deliver a value, mixing pre-failover device
data with the new endpoint's data and briefly reporting a value the active endpoint
never produced. There is no per-adapter generation/epoch tag on TagValueReceived to
distinguish current from stale callbacks.
Recommendation
Add an adapter-generation counter incremented on every adapter swap; stamp it onto
TagValueReceived (captured in the callback closure) and drop messages whose
generation does not match the current adapter in HandleTagValueReceived.
Resolution
Unresolved.
DataConnectionLayer-012 — AutoAcceptUntrustedCerts defaults to true, accepting any server certificate
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17, src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61, docs/requirements/Component-DataConnectionLayer.md:116 |
Description
OpcUaConnectionOptions.AutoAcceptUntrustedCerts defaults to true, and
RealOpcUaClient.ConnectAsync wires CertificateValidator.CertificateValidation += (_, e) => e.Accept = true
when it is set. With the default, every server certificate is accepted unconditionally
— there is no certificate-pinning or trust-store enforcement — which defeats the
Sign/SignAndEncrypt security modes against an active man-in-the-middle on the OPC
UA link. The design doc explicitly lists true as the default. For an industrial
control link this is a meaningful exposure; a secure-by-default posture would reject
untrusted certs unless an operator opts in per connection.
Recommendation
Default AutoAcceptUntrustedCerts to false and require explicit per-connection
opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
installed. Update the design doc to reflect the secure default.
Resolution
Unresolved.
DataConnectionLayer-013 — Misleading XML comment: RaiseDisconnected claims thread safety it does not provide
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281 |
Description
The XML doc on RaiseDisconnected states "Thread-safe: only the first caller triggers
the event." The implementation is a non-atomic check-then-set on a volatile bool
(if (_disconnectFired) return; _disconnectFired = true;). volatile guarantees
visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
OnClientConnectionLost and a ReadAsync failure path) can both observe
_disconnectFired == false and both invoke Disconnected. In practice the
DataConnectionActor tolerates a duplicate AdapterDisconnected message, so impact
is low, but the comment overstates the guarantee. The same pattern exists in
RealOpcUaClient.OnSessionKeepAlive (_connectionLostFired).
Recommendation
Either make the guard atomic (Interlocked.Exchange with an int flag, or a lock),
or correct the comment to say "best-effort once-only; a duplicate event is possible
under a race and is tolerated downstream."
Resolution
Unresolved.