Files
scadalink-design/code-reviews/DataConnectionLayer/findings.md

50 KiB
Raw Blame History

Code Review — DataConnectionLayer

Field Value
Module src/ScadaLink.DataConnectionLayer
Design doc docs/requirements/Component-DataConnectionLayer.md
Status Reviewed
Last reviewed 2026-05-17
Reviewer claude-agent
Commit reviewed 39d737e
Open findings 0

Summary

The DataConnectionLayer is a reasonably well-structured module: the Become/Stash lifecycle state machine, the captured-Self marshalling of background-thread disconnect events, and the protocol-factory abstraction all follow the design doc and Akka.NET conventions. However, the review found one critical actor-model violation — HandleSubscribe spawns a Task.Run that mutates the actor's private dictionaries and counters from a thread-pool thread, racing with the actor's own message loop. Several high-severity issues cluster around concurrency and error handling: the subscription-failure path leaves the connection with degraded subtrees but no real recovery, the DataConnectionManagerActor's Restart supervision drops all subscription state on a connection-actor crash, and RealOpcUaClient's monitored- item callback dictionary is mutated without synchronization while OPC UA notification threads read it. The remaining findings concern stale health counters after failover, an unused WriteTimeout option (writes are unbounded despite the design promising a 30 s timeout), ReadBatchAsync aborting mid-batch, and documentation drift between the design doc's failover state machine and the implemented unstable-disconnect heuristic. Test coverage is adequate for the happy paths and failover but absent for tag-resolution retry, disconnect/re-subscribe, and concurrency around HandleSubscribe.

Re-review 2026-05-17 (commit 39d737e)

All 13 findings from the 2026-05-16 review remain Resolved and the fixes were verified in place against the current source (PipeTo(Self) subscribe pattern, Resume supervision, ConcurrentDictionary callback maps, atomic disconnect guards, bounded write timeout, etc.). The re-review walked all 10 checklist categories again and found 4 new findings: one High — the DCL-012 security warning is never seen in production because RealOpcUaClientFactory.Create() constructs RealOpcUaClient with no logger, so the warning sinks into NullLogger; one Medium — initial-connect failures in the Connecting state never count toward failover, so a connection whose primary endpoint is unreachable at startup retries the primary forever and never tries the configured backup; one MediumHandleSubscribeCompleted always replies SubscribeTagsResponse(success: true) even when a connection-level subscribe failure is driving the actor into Reconnecting, telling the Instance Actor the subscribe succeeded when it did not; and one LowWriteBatchAsync does not catch the InvalidOperationException from EnsureConnected, so a mid-batch disconnect aborts the whole write batch (the same class of defect DCL-007 fixed for ReadBatchAsync). New findings are numbered from DataConnectionLayer-014.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs x 2026-05-16 findings resolved. Re-review: finding 016 — SubscribeTagsResponse reports success on a connection-level subscribe failure.
2 Akka.NET conventions x 2026-05-16 findings resolved (PipeTo(Self) subscribe, Resume supervision). Re-review: no new issues.
3 Concurrency & thread safety x 2026-05-16 findings resolved (ConcurrentDictionary, atomic disconnect guards). Re-review: no new issues.
4 Error handling & resilience x Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — WriteBatchAsync aborts on mid-batch disconnect.
5 Security x Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (RealOpcUaClient built without a logger).
6 Performance & resource management x 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues.
7 Design-document adherence x 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience).
8 Code organization & conventions x No issues found — POCOs in Commons, options class owned by component, factory pattern consistent.
9 Testing coverage x DCL001013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure).
10 Documentation & comments x 2026-05-16 finding 013 resolved. Re-review: no new issues.

Findings

DataConnectionLayer-001 — Task.Run in HandleSubscribe mutates actor state off the actor thread

Severity Critical
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538

Description

HandleSubscribe launches a Task.Run(async () => ...) that runs on a thread-pool thread and directly mutates the actor's private mutable state: instanceTags (a reference into _subscriptionsByInstance), _subscriptionIds, _totalSubscribed, _resolvedTags, and _unresolvedTags. All of these are simultaneously read and written by the actor's own message loop (HandleTagValueReceived, HandleUnsubscribe, ReSubscribeAll, HandleRetryTagResolution, ReplyWithHealthReport). This is a direct violation of the Akka.NET actor model, which guarantees single-threaded access to actor state only when state is touched on the actor thread. Two concurrent subscribe requests, or a subscribe overlapping a TagValueReceived / GetHealthReport, produce data races on Dictionary/HashSet/intDictionary is not thread-safe and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can also produce torn reads of the health counters.

Recommendation

Do not mutate actor state from the background task. Perform only the await _adapter.SubscribeAsync(...) / ReadAsync(...) I/O in the task, collect the results into a local immutable result object, and PipeTo(Self) an internal message (e.g. SubscribeCompleted) whose handler — running on the actor thread — applies all state mutations and counter updates. The response to Sender should be sent from that handler too.

Resolution

Resolved 2026-05-16. HandleSubscribe was restructured to follow the actor's own PipeTo(Self) pattern (the one already used by HandleRetryTagResolution): the background Task.Run now performs only adapter I/O (SubscribeAsync/ReadAsync), collects per-tag outcomes into an immutable SubscribeCompleted message, and pipes that to Self. All mutation of _subscriptionIds, _subscriptionsByInstance, _totalSubscribed, _resolvedTags and _unresolvedTags now happens in the new HandleSubscribeCompleted handler on the actor thread; it is wired into the Connected, Connecting and Reconnecting states so an in-flight subscribe is applied regardless of state transitions. Regression test DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30×30 concurrent subscribes) fails against the pre-fix code and passes after. Fixed by the commit whose message references DataConnectionLayer-001.

DataConnectionLayer-002 — Restart supervision discards all subscription state on connection-actor crash

Severity High
Category Akka.NET conventions
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141

Description

DataConnectionManagerActor.SupervisorStrategy returns a OneForOneStrategy with Directive.Restart for DataConnectionActor failures. On restart, Akka.NET creates a fresh actor instance, so all in-memory fields — _subscriptionsByInstance, _subscriptionIds, _subscribers, _unresolvedTags, the quality counters — are silently discarded. The actor re-enters Connecting with zero subscriptions, and the design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors that had subscribed before the crash never get their tags re-subscribed and will sit at uncertain/stale quality indefinitely with no error returned. There is no durable subscription store from which a restarted actor could rebuild state.

Recommendation

Either (a) make the subscription registry durable/recoverable so a restarted actor can rebuild it (persist to local SQLite as the design doc says connection definitions are, and have PreStart reload subscriptions), or (b) treat a connection-actor crash as a lifecycle event the DataConnectionManagerActor notices, so it can re-issue the subscription registrations. At minimum document that subscribers must re-register after a crash and surface the lost-state condition rather than failing silently.

Resolution

Resolved 2026-05-16. The DataConnectionManagerActor.SupervisorStrategy was changed from Directive.Restart to Directive.Resume for DataConnectionActor failures. Resume keeps the existing actor instance and all its in-memory subscription state (_subscriptionsByInstance, _subscriptionIds, _subscribers, quality counters) intact across a transient handler exception, so the design doc's "transparent re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful coordinator and its own Become/Stash reconnect state machine already recovers connection-level faults — it does not need a restart. This also aligns with the ScadaLink convention of Resume for coordinator actors. Regression test DCL002_ConnectionActorCrash_PreservesSubscriptionState crashes the connection actor via a synchronously-throwing write and asserts the subscription survives (health report still shows 1 subscribed/resolved tag); it fails against the pre-fix Restart code and passes after. Fixed by the commit whose message references DataConnectionLayer-002 (commit <pending>).

DataConnectionLayer-003 — RealOpcUaClient callback/monitored-item dictionaries mutated without synchronization

Severity High
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184

Description

_monitoredItems and _callbacks are plain Dictionary<,> instances. They are written from CreateSubscriptionAsync / RemoveSubscriptionAsync (invoked from the DataConnectionActor's Task.Run / ContinueWith continuations, i.e. thread-pool threads) and from DisconnectAsync (.Clear()), while being read concurrently from the OPC Foundation SDK's MonitoredItem.Notification event handler, which fires on the SDK's internal publish threads (_callbacks.TryGetValue(handle, ...) at line 163). Concurrent reads during a Dictionary resize or Clear() are undefined behaviour — they can throw InvalidOperationException, return wrong entries, or corrupt the dictionary. The DataConnectionActor's subscribe path already runs off the actor thread (finding 001), so multiple subscribe calls can also race each other here.

Recommendation

Use ConcurrentDictionary<,> for _monitoredItems and _callbacks, or guard all access with a lock. Note that fixing finding 001 (serialising subscribe through the actor thread) reduces but does not eliminate the race, because the SDK notification threads still read _callbacks concurrently with RemoveSubscriptionAsync / DisconnectAsync.

Resolution

Resolved 2026-05-16. _monitoredItems and _callbacks in RealOpcUaClient were changed from plain Dictionary<,> to ConcurrentDictionary<,>, and the two Remove(key) call sites switched to TryRemove. This makes the maps safe to read from the OPC Foundation SDK's publish threads (MonitoredItem.Notification reading _callbacks) concurrently with subscribe/disconnect mutations on other threads. RealOpcUaClient wraps concrete OPC Foundation SDK types (ISession, Subscription, MonitoredItem) and cannot be exercised without a live OPC UA server, so the regression is guarded structurally by DCL003_SharedDictionaryFields_AreConcurrentCollections (a reflection test asserting both fields are ConcurrentDictionary<,>); it fails against the pre-fix Dictionary code and passes after. Fixed by the commit whose message references DataConnectionLayer-003 (commit <pending>).

DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly

Severity High
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537

Description

When _adapter.SubscribeAsync throws inside the HandleSubscribe background task, the catch block adds the tag to _unresolvedTags and increments _totalSubscribed, treating every subscribe exception as a tag-resolution failure. But SubscribeAsync also throws InvalidOperationException from EnsureConnected() when the OPC UA client is not connected, and throws on transport faults — these are connection problems, not bad tag paths. They get misclassified as unresolved tags and retried on the 10 s tag-resolution timer instead of triggering the reconnection state machine. Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute must be marked quality bad; the code never pushes a bad-quality update to the subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor stays at uncertain quality with no signal. The TagResolutionFailed message it sends to Self only logs and re-arms the timer (HandleTagResolutionFailed).

Recommendation

Distinguish connection-level exceptions (raise AdapterDisconnected / let the reconnect machine handle them) from genuine node-not-found errors. For genuine resolution failures, push a TagValueUpdate with QualityCode.Bad to the subscribing Instance Actor so it reflects the documented behaviour.

Resolution

Resolved 2026-05-16. The subscribe background task now classifies each subscribe exception via the new IsConnectionLevelFailure helper (InvalidOperationException — thrown by EnsureConnected() — plus SocketException/TimeoutException/ IOException count as connection-level; anything else is a genuine resolution failure). The classification is carried on SubscribeTagResult.ConnectionLevelFailure and applied on the actor thread in HandleSubscribeCompleted: connection-level failures no longer become unresolved tags and instead drive the reconnection state machine (HandleSubscribeCompleted returns a flag and the Connected-state handler calls BecomeReconnecting); genuine resolution failures still go to _unresolvedTags and the retry timer, and now also push a TagValueUpdate with QualityCode.Bad to the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2. Regression tests DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber and DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry both fail against the pre-fix code and pass after. Fixed by the commit whose message references DataConnectionLayer-004 (commit <pending>).

DataConnectionLayer-005 — WriteTimeout option is documented and configured but never applied

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15, src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590

Description

DataConnectionOptions.WriteTimeout (default 30 s) and the design doc's "Shared Settings" table both promise a bounded timeout for synchronous device writes. The value is never read anywhere in the module (grep confirms only the declaration). HandleWrite calls _adapter.WriteAsync(request.TagPath, request.Value) with no CancellationToken and no timeout. If the OPC UA server hangs (TCP black-hole, no RST), the write Task never completes, PipeTo(sender) never fires, and the calling script's Ask blocks until its own ask-timeout — and the script gets no DCL-level error. The design states write failures (including timeout) must be returned synchronously to the script; an unbounded write violates that.

Recommendation

Create a CancellationTokenSource(_options.WriteTimeout), pass its token to WriteAsync, and in the continuation translate cancellation into a failed WriteTagResponse with a timeout error message. Apply the same to the read used by the initial-value seed and to WriteBatchAndWaitAsync paths if they are reachable.

Resolution

Resolved 2026-05-16. HandleWrite now creates a CancellationTokenSource(_options.WriteTimeout), passes its token to _adapter.WriteAsync(...), and disposes the source in the continuation. A cancelled/timed-out write (Task.IsCanceled or a base OperationCanceledException) is translated into a failed WriteTagResponse with a "Write timeout after Ns" message, so a hung device write is bounded and the failure is returned synchronously to the calling script (WP-11) instead of blocking until the script's own Ask-timeout. (The WriteBatchAndWaitAsync adapter path already accepts an explicit timeout/CancellationToken and is not invoked by HandleWrite, so no change was needed there.) Regression test DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously uses an adapter whose WriteAsync only completes when its token fires; it fails against the pre-fix unbounded code and passes after. Fixed by the commit whose message references DataConnectionLayer-005 (commit <pending>).

DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756

Description

ReSubscribeAll resets _subscriptionIds, _unresolvedTags and _resolvedTags to a clean slate, but leaves _lastTagQuality, _tagsGoodQuality, _tagsBadQuality and _tagsUncertainQuality untouched. PushBadQualityForAllTags (called on disconnect) sets _tagsBadQuality = _lastTagQuality.Count and zeroes the others. After a reconnect, HandleTagValueReceived decrements the old bucket using _lastTagQuality's value and increments the new one — but tags resolved for the first time after reconnect were never in _lastTagQuality, so they only increment, never decrement, and the totals can drift above _totalSubscribed. Over repeated disconnect/reconnect cycles the health report's good/bad/uncertain counts become unreliable.

Verification note: Confirmed against source. The root cause is broader than the reconnect path the finding describes: HandleUnsubscribe also never removes a tag from _lastTagQuality nor decrements its quality bucket, so an unsubscribed tag lingers and PushBadQualityForAllTags (which sets _tagsBadQuality = _lastTagQuality.Count) over-counts it — driving the bad-quality count above _totalSubscribed even without a re-subscribe. Both the unsubscribe leak and the re-subscribe drift are real.

Recommendation

On BecomeConnected after a re-subscribe (or in ReSubscribeAll), clear _lastTagQuality and the three quality counters and let them be repopulated from fresh TagValueReceived messages. Alternatively recompute the buckets from _lastTagQuality whenever it changes rather than maintaining incremental counters.

Resolution

Resolved 2026-05-16 (commit pending). HandleUnsubscribe now removes each unsubscribed tag from _lastTagQuality and decrements the corresponding quality bucket, then reports the corrected counters via UpdateTagQuality/UpdateTagResolution; ReSubscribeAll clears _lastTagQuality and zeroes the three quality counters so post-reconnect tags are repopulated from fresh TagValueReceived messages instead of only incrementing. Regression test DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags subscribes two tags, pushes Good values, unsubscribes one, then disconnects and asserts PushBadQualityForAllTags reports exactly 1 bad tag (the reconnect is gated open so ReSubscribeAll does not run before the assertion); it reports 2 against the pre-fix code and 1 after.

DataConnectionLayer-007 — ReadBatchAsync aborts the whole batch on the first failing tag

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195

Description

ReadBatchAsync loops calling ReadAsync per tag. ReadAsync re-throws any non-cancellation exception (line 184). So if any single tag in the batch throws (bad node, transient fault), the entire ReadBatchAsync throws and the caller gets no results for the tags that did read successfully — even though ReadResult already has a Success/ErrorMessage shape designed to carry per-tag failures. The batch is also fully serial (one round-trip per tag), defeating the point of a batch API; the design doc lists ReadBatch/WriteBatch as first-class operations.

Verification note: Confirmed against source — ReadAsync re-throws on any non-OperationCanceledException, aborting the whole batch.

Recommendation

Catch per-tag exceptions inside the loop and store a failed ReadResult for that tag so the batch returns a complete map. Ideally issue a single OPC UA Read service call for all node IDs (RealOpcUaClient.ReadValueAsync already builds a ReadValueIdCollection — extend it to accept multiple nodes).

Resolution

Resolved 2026-05-16 (commit pending). ReadBatchAsync now wraps each per-tag ReadAsync in a try/catch: a per-tag exception is recorded as a failed ReadResult (Success: false, message = the exception message) so the batch returns a complete result map for every requested tag; OperationCanceledException is still propagated so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call optimisation were deliberately left for a follow-up — they are a performance concern, not the correctness bug this finding raised. Regression test DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails reads three tags where the middle one throws and asserts all three appear in the result map with the failing one marked unsuccessful; it threw (no map returned) against the pre-fix code and passes after.

DataConnectionLayer-008 — HandleUnsubscribe is O(n^2) over instances and rechecks _unresolvedTags redundantly

Severity Low
Category Performance & resource management
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569

Description

For each tag of the instance being removed, HandleUnsubscribe scans every other instance's tag set (_subscriptionsByInstance.Where(...).Any()), making the operation O(tags x instances). On a site with many instances sharing a connection this is needlessly expensive on every instance stop/redeploy. Separately, line 562 re-evaluates !_unresolvedTags.Contains(tagPath) immediately after line 561 already removed tagPath from _unresolvedTags, so the condition is always true — dead logic that obscures intent (the decrement of _resolvedTags is unconditional in practice).

Recommendation

Maintain a reference count per tag path (or a tagPath -> set<instance> reverse index) so the "any other subscriber" check is O(1). Remove the redundant _unresolvedTags re-check or restructure so the resolved/unresolved decrement reflects the tag's actual prior state captured before removal.

Resolution

Resolved 2026-05-16 (commit pending). A tagPath → subscriber-count reverse index (_tagSubscriberCount) was added: HandleSubscribeCompleted increments it whenever a tag is newly added to an instance's set, and HandleUnsubscribe decrements it, releasing a tag at the adapter only when the count reaches zero. The "any other subscriber" check is now O(1) per tag instead of an O(instances) Where(...).Any() scan. The redundant !_unresolvedTags.Contains(tagPath) re-check (always true after the unconditional Remove on the line above) was removed — the surviving branch is entered only for tags that have a subscription id, which are by definition resolved, so _resolvedTags-- is now unconditional with an explanatory comment. The cleanup also fixed a latent leak the original code could not reach: an unresolved tag whose last subscriber unsubscribes is now removed from _unresolvedTags/_resolutionInFlight and decremented from _totalSubscribed (previously it lingered in the retry timer and the subscribed total forever). Regression test DCL008_Unsubscribe_OnlyReleasesTagWhenLastSubscriberLeaves subscribes a tag from two instances plus an exclusive tag, then unsubscribes each instance and asserts the shared tag is released at the adapter only after the last subscriber leaves and the health counters track correctly. (This finding is a performance refactor, not a correctness bug — the pre-fix Where(...).Any() logic was functionally correct, so the test passes against both versions and serves as a behavioural guard for the refactor.)

DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine

Severity Medium — partially design-doc work outside this module's editable scope
Category Design-document adherence
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449, docs/requirements/Component-DataConnectionLayer.md:73-85

Description

The design doc's failover state machine reads "retry active endpoint (5s) -> N failures (>= FailoverRetryCount) -> switch to other endpoint". The code implements two separate failover triggers: (a) HandleReconnectResult counts _consecutiveFailures on connect-attempt failures (matches the doc), and (b) BecomeReconnecting additionally counts _consecutiveUnstableDisconnects — connections that succeeded but dropped within a hard-coded 60 s StableConnectionThreshold — and fails over on that count too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can happen on successful-but-flaky connections are not described in the component doc at all. A reviewer or operator reading Component-DataConnectionLayer.md would not predict this behaviour, and the 60 s threshold is a magic constant not exposed via DataConnectionOptions.

Verification note: Confirmed against source. The hard-coded StableConnectionThreshold = TimeSpan.FromSeconds(60) static readonly field and the _consecutiveUnstableDisconnects failover path both exist as described.

Recommendation

Update Component-DataConnectionLayer.md to document the unstable-disconnect failover path and the stability threshold, and move the 60 s threshold into DataConnectionOptions so it is configurable and consistent with the other tunables.

Resolution

Resolved 2026-05-16 (commit pending). The configurability half of the recommendation is done: the hard-coded StableConnectionThreshold constant was removed from DataConnectionActor and replaced with a new DataConnectionOptions.StableConnectionThreshold property (60 s default), bindable from the DataConnectionLayer appsettings.json section like ReconnectInterval/TagResolutionRetryInterval/WriteTimeout. Regression test DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault guards the default and the setter. The documentation half is out of this module's editable scopedocs/requirements/Component-DataConnectionLayer.md (lines 73-85) still describes only the connect-failure failover path and does not mention the unstable-disconnect trigger. Action required (surfaced): the DCL design doc should be updated to document the unstable-disconnect failover path and the configurable stability threshold; that edit was deliberately not made here because this task is scoped to src/ScadaLink.DataConnectionLayer, tests, and this findings file only.

DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703

Description

HandleRetryTagResolution fires SubscribeAsync for every tag in _unresolvedTags via ContinueWith(...).PipeTo(self), but does not remove the tags from _unresolvedTags while the attempts are in flight. Because tags are not removed before the retry, a slow SubscribeAsync overlapping the next 10 s tick issues duplicate concurrent subscribe attempts for the same tag, which can create duplicate monitored items / leaked subscription IDs (the second success overwrites _subscriptionIds[tag] in HandleTagResolutionSucceeded, orphaning the first handle with no UnsubscribeAsync call). The timer-cancel condition in HandleTagResolutionSucceeded is also non-deterministic for the same reason.

Verification note: Confirmed against source — HandleRetryTagResolution dispatched SubscribeAsync for every tag in _unresolvedTags on every tick with no in-flight guard.

Recommendation

Remove tags from _unresolvedTags (into an "in-flight" set) when a retry is dispatched, and only put them back on failure. This prevents overlapping duplicate subscribe attempts and makes the timer-cancel condition deterministic.

Resolution

Resolved 2026-05-16 (commit pending). A new _resolutionInFlight HashSet<string> tracks tags whose retry SubscribeAsync is currently outstanding. HandleRetryTagResolution now dispatches only for unresolved tags not already in flight (and skips entirely if all are in flight), adding each dispatched tag to the set; HandleTagResolutionSucceeded and HandleTagResolutionFailed remove the tag from the set when its attempt completes, and HandleUnsubscribe/ReSubscribeAll clear stale entries. This prevents overlapping duplicate subscribe attempts and the resulting orphaned monitored items. Regression test DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes gives a tag a genuine initial failure then a retry SubscribeAsync that never completes, lets six 100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total subscribe calls); the pre-fix code dispatched on every tick (6 total).

DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425, src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262

Description

On failover the actor disposes the old adapter (_adapter.DisposeAsync(), fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks captured self and tagPath and Tell TagValueReceived to the actor. While the Reconnecting handler ignores TagValueReceived (line 334), once the actor reaches Connected again it processes them — and a disposed adapter whose OPC UA SDK threads have not yet fully torn down could still deliver a value, mixing pre-failover device data with the new endpoint's data and briefly reporting a value the active endpoint never produced. There is no per-adapter generation/epoch tag on TagValueReceived to distinguish current from stale callbacks.

Verification note: Confirmed against source — TagValueReceived carried no adapter identity, and HandleTagValueReceived (reachable in Connected) processed any such message regardless of which adapter produced it.

Recommendation

Add an adapter-generation counter incremented on every adapter swap; stamp it onto TagValueReceived (captured in the callback closure) and drop messages whose generation does not match the current adapter in HandleTagValueReceived.

Resolution

Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new _adapterGeneration int field is incremented at both adapter-swap sites (the unstable-disconnect failover in BecomeReconnecting and the connect-failure failover in HandleReconnectResult). The TagValueReceived record gained an AdapterGeneration field; every subscription callback closure (HandleSubscribe, the initial-read seed, HandleRetryTagResolution, ReSubscribeAll) captures the generation in effect at subscribe time and stamps it onto each TagValueReceived. HandleTagValueReceived drops any message whose generation no longer matches the current adapter, so a callback fired by a disposed adapter after failover cannot reach an Instance Actor. Regression test DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover subscribes on the primary, fails over to the backup, then invokes the captured primary callback with a stale value and asserts the subscriber receives nothing; the stale value reached the subscriber against the pre-fix code and is dropped after.

DataConnectionLayer-012 — AutoAcceptUntrustedCerts defaults to true, accepting any server certificate

Severity Medium — full secure default also requires a Commons + design-doc change outside this module
Category Security
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17, src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61, docs/requirements/Component-DataConnectionLayer.md:116

Description

OpcUaConnectionOptions.AutoAcceptUntrustedCerts defaults to true, and RealOpcUaClient.ConnectAsync wires CertificateValidator.CertificateValidation += (_, e) => e.Accept = true when it is set. With the default, every server certificate is accepted unconditionally — there is no certificate-pinning or trust-store enforcement — which defeats the Sign/SignAndEncrypt security modes against an active man-in-the-middle on the OPC UA link. The design doc explicitly lists true as the default. For an industrial control link this is a meaningful exposure; a secure-by-default posture would reject untrusted certs unless an operator opts in per connection.

Verification note: Confirmed against source. Note the authoritative runtime default does not actually live on OpcUaConnectionOptions — for a real connection OpcUaDataConnection.ConnectAsync builds OpcUaConnectionOptions from OpcUaEndpointConfig (in ScadaLink.Commons), whose AutoAcceptUntrustedCerts property also defaults to true. OpcUaConnectionOptions' own default is only the fallback used when an OpcUaConnectionOptions is constructed directly.

Recommendation

Default AutoAcceptUntrustedCerts to false and require explicit per-connection opt-in, or at minimum log a prominent warning whenever the auto-accept validator is installed. Update the design doc to reflect the secure default.

Resolution

Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation are done: (1) OpcUaConnectionOptions.AutoAcceptUntrustedCerts now defaults to false; (2) RealOpcUaClient.ConnectAsync logs a prominent ILogger warning whenever the auto-accept certificate validator is installed (an ILogger<RealOpcUaClient> was added as an optional constructor parameter, defaulting to NullLogger, so existing callers are unaffected). Regression test DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse guards the new secure default. Two parts remain outside this module's editable scope and are surfaced as action required: (a) ScadaLink.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts still defaults to true — since that is the value actually used for a real connection (see verification note above), the Commons default must also be flipped to false for the system to be secure-by-default; (b) docs/requirements/Component-DataConnectionLayer.md line 116 still documents true as the default and must be updated. Both edits were deliberately not made here because this task is scoped to src/ScadaLink.DataConnectionLayer, tests, and this findings file only.

DataConnectionLayer-013 — Misleading XML comment: RaiseDisconnected claims thread safety it does not provide

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281

Description

The XML doc on RaiseDisconnected states "Thread-safe: only the first caller triggers the event." The implementation is a non-atomic check-then-set on a volatile bool (if (_disconnectFired) return; _disconnectFired = true;). volatile guarantees visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via OnClientConnectionLost and a ReadAsync failure path) can both observe _disconnectFired == false and both invoke Disconnected. In practice the DataConnectionActor tolerates a duplicate AdapterDisconnected message, so impact is low, but the comment overstates the guarantee. The same pattern exists in RealOpcUaClient.OnSessionKeepAlive (_connectionLostFired).

Recommendation

Either make the guard atomic (Interlocked.Exchange with an int flag, or a lock), or correct the comment to say "best-effort once-only; a duplicate event is possible under a race and is tolerated downstream."

Resolution

Resolved 2026-05-16 (commit pending). Rather than weaken the XML comment to match the weak guard, the guard was made genuinely atomic so the documented "only the first caller fires the event" guarantee becomes true. OpcUaDataConnection._disconnectFired and RealOpcUaClient._connectionLostFired were changed from volatile bool to int, and the check-then-set in RaiseDisconnected / OnSessionKeepAlive replaced with a single Interlocked.Exchange(ref flag, 1) != 0 compare-and-set; the reset on connect uses Interlocked.Exchange(ref flag, 0). The XML comments on both methods were updated to describe the atomic compare-and-set explicitly. Regression test DCL013_ConcurrentConnectionLost_RaisesDisconnectedExactlyOnce runs 25 rounds, each fanning 32 barrier-synchronised threads that raise the client's ConnectionLost event simultaneously, and asserts Disconnected fires exactly once per round; against a non-atomic check-then-set it double-fires (verified by temporarily reverting the guard), and it passes against the atomic fix.

DataConnectionLayer-014 — DCL-012 security warning is never logged in production: RealOpcUaClient is created without a logger

Severity High
Category Security
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325, src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83

Description

Finding DataConnectionLayer-012 was resolved in part by adding a prominent ILogger warning in RealOpcUaClient.ConnectAsync whenever the auto-accept certificate validator is installed (RealOpcUaClient.cs:79-83). The ILogger<RealOpcUaClient> constructor parameter was made optional, defaulting to NullLogger<RealOpcUaClient>.Instance (RealOpcUaClient.cs:35-39).

However, the only production code path that constructs a RealOpcUaClient is RealOpcUaClientFactory.Create() (RealOpcUaClient.cs:325), which calls new RealOpcUaClient(_globalOptions) and passes no logger. The factory itself holds only an OpcUaGlobalOptions and has no ILoggerFactory/ILogger available. As a result the _logger field is always NullLogger for every real OPC UA connection, and the man-in-the-middle warning the DCL-012 fix added is silently discarded. An operator who deploys a connection with AutoAcceptUntrustedCerts enabled — accepting any server certificate on an industrial control link — gets no visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is therefore not actually effective in production; only the unit test (DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse, which only checks the default value) passes.

Recommendation

Thread a real logger through to RealOpcUaClient. DataConnectionFactory already holds an ILoggerFactory and constructs RealOpcUaClientFactory(globalOptions) — give RealOpcUaClientFactory an ILoggerFactory (or an ILogger<RealOpcUaClient>) constructor parameter and pass _loggerFactory.CreateLogger<RealOpcUaClient>() into each new RealOpcUaClient(...). Add a test that asserts the warning is emitted on a real connect with auto-accept enabled (e.g. via a captured ILogger), not just that the default is false.

Resolution

Resolved 2026-05-17 (commit pending). RealOpcUaClientFactory gained an ILoggerFactory constructor parameter and Create() now passes _loggerFactory.CreateLogger<RealOpcUaClient>() into every RealOpcUaClient it builds; DataConnectionFactory (which already holds an ILoggerFactory) now constructs RealOpcUaClientFactory(globalOptions, _loggerFactory), so the DCL-012 auto-accept-cert warning reaches a real logger in production instead of being discarded by NullLogger. The parameterless / single-arg factory constructors still default to NullLoggerFactory so existing callers are unaffected. Regression tests DCL014_RealOpcUaClientFactory_CreatesClientWithRealLogger and DCL014_DataConnectionFactory_ThreadsLoggerToRealOpcUaClient fail against the pre-fix code (the first fails to compile — no logger ctor; the second observes a NullLogger) and pass after.

DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417, src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493

Description

Failover between the primary and backup endpoints is implemented in two places, both reachable only after a connection has already been Connected at least once: HandleReconnectResult (Reconnecting state) counts _consecutiveFailures and switches endpoint, and BecomeReconnecting counts _consecutiveUnstableDisconnects.

HandleConnectResult — the handler for the initial connection attempt in the Connecting state (DataConnectionActor.cs:404-417) — does neither. On failure it only logs and re-arms the reconnect timer with AttemptConnect; it never increments _consecutiveFailures, never consults _backupConfig, and never switches endpoint.

Consequence: if the primary endpoint is unreachable when the connection actor first starts — which is the common case after a fresh artifact deployment, a site restart, or a primary that is simply down at that moment — the actor retries the primary endpoint indefinitely at ReconnectInterval and never attempts the configured backup. The design doc's endpoint-redundancy promise ("automatic failover when the active endpoint becomes unreachable") is silently not honoured for the never-connected-yet case, and an operator sees a connection stuck Connecting forever despite a healthy backup being configured.

Recommendation

Make HandleConnectResult participate in the failover counter the same way HandleReconnectResult does: increment _consecutiveFailures on failure and, when _backupConfig != null && _consecutiveFailures >= _failoverRetryCount, perform the endpoint switch (dispose adapter, create the other adapter, bump _adapterGeneration, log the failover event) before re-arming the timer. Alternatively, fold the initial connect into the same reconnect path so there is a single failover decision point. Add a regression test for "primary down at startup, backup configured → fails over to backup".

Resolution

Resolved 2026-05-17 (commit pending). The endpoint-switch failover logic was extracted from HandleReconnectResult into a shared CountFailureAndMaybeFailover helper, and HandleConnectResult (the initial-connect handler in the Connecting state) now increments _consecutiveFailures and calls that helper on failure — so a primary that is unreachable at startup fails over to the configured backup after FailoverRetryCount attempts instead of retrying the primary forever. Single-endpoint connections (no backup) still retry indefinitely. Regression test DCL015_PrimaryDownAtStartup_FailsOverToBackup fails against the pre-fix code (the backup adapter is never created) and passes after; DCL015_SingleEndpointDownAtStartup_RetriesIndefinitely_NoFailover guards the no-backup path.

DataConnectionLayer-016 — HandleSubscribeCompleted reports SubscribeTagsResponse success even on a connection-level subscribe failure

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672, src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240

Description

HandleSubscribeCompleted computes connectionLevelFailure (line 606) and returns it so the Connected-state handler can drive the actor into Reconnecting (DataConnectionActor.cs:232-240). But before returning, it unconditionally replies to the caller with new SubscribeTagsResponse(..., true, null, ...) (lines 666-667) — Success: true, Error: null — regardless of whether any tag failed at connection level.

So when a subscribe arrives while the adapter is silently down, the Instance Actor is told the subscribe succeeded, while the connection actor simultaneously transitions to Reconnecting. The tags were never actually subscribed at the adapter (the catch block recorded Success: false); they are recovered later by ReSubscribeAll only if and when reconnection succeeds. The caller has no way to distinguish "subscribed and healthy" from "accepted, but the connection is currently down" — a misleading success signal on a request that did not do what the response claims.

(Genuine tag-resolution failures are arguably also reported as overall true, but that is defensible: those tags are tracked in _unresolvedTags and the design models unresolved tags as a runtime quality concern, with a Bad-quality TagValueUpdate already pushed. The connection-level case is the clear defect because the actor itself treats it as a failure worth a state transition.)

Recommendation

When connectionLevelFailure is true, reply with SubscribeTagsResponse(..., success: false, error: "connection unavailable — will re-subscribe on reconnect", ...) (or an equivalent), so the caller's response matches the actor's own assessment. Optionally carry per-tag outcomes in the response so the Instance Actor can reflect partial success. Add a test asserting the response is not Success: true when a connection-level subscribe failure drives Reconnecting.

Resolution

Resolved 2026-05-17 (commit pending). HandleSubscribeCompleted now replies SubscribeTagsResponse(success: false, error: "connection unavailable — will re-subscribe on reconnect") when connectionLevelFailure is true — matching the actor's own assessment as it drives into Reconnecting — instead of unconditionally replying success: true. Genuine tag-resolution failures still reply success: true (they are a runtime quality concern tracked via _unresolvedTags, with a Bad-quality TagValueUpdate already pushed), so that case is not regressed. Regression test DCL016_ConnectionLevelSubscribeFailure_RepliesWithUnsuccessfulResponse fails against the pre-fix code (it returned Success: true) and passes after; DCL016_GenuineResolutionFailure_StillRepliesSuccess guards the resolution-failure path.

DataConnectionLayer-017 — WriteBatchAsync aborts the whole batch on a mid-batch disconnect

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237, src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227

Description

WriteBatchAsync loops calling WriteAsync per tag (OpcUaDataConnection.cs:229-237). WriteAsync returns a WriteResult for OPC-UA-level write rejections (good — a bad status does not abort the batch), but it first calls EnsureConnected() (OpcUaDataConnection.cs:220), which throws InvalidOperationException when the client is disconnected. WriteBatchAsync does not catch that exception, so if the connection drops partway through a batch the whole WriteBatchAsync throws and the caller gets no result map — losing the per-tag outcomes for the tags that already wrote. This is the same class of defect that DataConnectionLayer-007 fixed for ReadBatchAsync (which now records a failed ReadResult per failing tag and only propagates OperationCanceledException). WriteBatchAsync feeds WriteBatchAndWaitAsync (line 246), so a disconnect during a flag-and-wait write sequence surfaces as an unhandled exception rather than a clean false/per-tag result.

Severity is Low because device writes are real-time control operations with no store-and-forward, the batch write paths are not on the primary HandleWrite hot path (HandleWrite calls single-tag WriteAsync), and a disconnect mid-batch is itself an error condition — but the inconsistent error shape (exception vs. per-tag result) is a maintainability and correctness wart.

Recommendation

Mirror the DCL-007 fix: wrap the per-tag WriteAsync call in WriteBatchAsync in a try/catch that records a failed WriteResult(false, ex.Message) for the failing tag and continues, while still propagating OperationCanceledException to abort a cancelled batch as a whole. This gives callers (including WriteBatchAndWaitAsync) a complete, consistent result map.

Resolution

Resolved 2026-05-17 (commit pending). WriteBatchAsync now wraps each per-tag WriteAsync call in a try/catch — mirroring the DCL-007 fix for ReadBatchAsync — so a mid-batch fault (the InvalidOperationException from EnsureConnected on a dropped connection) is recorded as a failed WriteResult(false, ex.Message) for that tag and the batch returns a complete result map; OperationCanceledException is still propagated so a cancelled batch aborts as a whole. Callers (including WriteBatchAndWaitAsync) now get a consistent per-tag result shape instead of an unhandled exception. Regression test DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch fails against the pre-fix code (the batch throws, no map returned) and passes after; DCL017_WriteBatch_CancellationAbortsWholeBatch guards that cancellation still aborts.