# Code Review — DataConnectionLayer | Field | Value | |-------|-------| | Module | `src/ScadaLink.DataConnectionLayer` | | Design doc | `docs/requirements/Component-DataConnectionLayer.md` | | Status | Reviewed | | Last reviewed | 2026-05-16 | | Reviewer | claude-agent | | Commit reviewed | `9c60592` | | Open findings | 0 | ## Summary The DataConnectionLayer is a reasonably well-structured module: the Become/Stash lifecycle state machine, the captured-`Self` marshalling of background-thread disconnect events, and the protocol-factory abstraction all follow the design doc and Akka.NET conventions. However, the review found one **critical** actor-model violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private dictionaries and counters from a thread-pool thread, racing with the actor's own message loop. Several **high**-severity issues cluster around concurrency and error handling: the subscription-failure path leaves the connection with degraded subtrees but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored- item callback dictionary is mutated without synchronization while OPC UA notification threads read it. The remaining findings concern stale health counters after failover, an unused `WriteTimeout` option (writes are unbounded despite the design promising a 30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between the design doc's failover state machine and the implemented unstable-disconnect heuristic. Test coverage is adequate for the happy paths and failover but absent for tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. | | 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. | | 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. | | 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. | | 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. | | 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. | | 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. | | 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. | | 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. | | 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. | ## Findings ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread | | | |--|--| | Severity | Critical | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` | **Description** `HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool thread and directly mutates the actor's private mutable state: `instanceTags` (a reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`, `_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`, `ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a direct violation of the Akka.NET actor model, which guarantees single-threaded access to actor state only when state is touched on the actor thread. Two concurrent subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`, produce data races on `Dictionary`/`HashSet`/`int` — `Dictionary` is not thread-safe and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can also produce torn reads of the health counters. **Recommendation** Do not mutate actor state from the background task. Perform only the `await _adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results into a local immutable result object, and `PipeTo(Self)` an internal message (e.g. `SubscribeCompleted`) whose handler — running on the actor thread — applies all state mutations and counter updates. The response to `Sender` should be sent from that handler too. **Resolution** Resolved 2026-05-16. `HandleSubscribe` was restructured to follow the actor's own `PipeTo(Self)` pattern (the one already used by `HandleRetryTagResolution`): the background `Task.Run` now performs only adapter I/O (`SubscribeAsync`/`ReadAsync`), collects per-tag outcomes into an immutable `SubscribeCompleted` message, and pipes that to `Self`. All mutation of `_subscriptionIds`, `_subscriptionsByInstance`, `_totalSubscribed`, `_resolvedTags` and `_unresolvedTags` now happens in the new `HandleSubscribeCompleted` handler on the actor thread; it is wired into the Connected, Connecting and Reconnecting states so an in-flight subscribe is applied regardless of state transitions. Regression test `DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters` (30×30 concurrent subscribes) fails against the pre-fix code and passes after. Fixed by the commit whose message references `DataConnectionLayer-001`. ### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash | | | |--|--| | Severity | High | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` | **Description** `DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with `Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors that had subscribed before the crash never get their tags re-subscribed and will sit at uncertain/stale quality indefinitely with no error returned. There is no durable subscription store from which a restarted actor could rebuild state. **Recommendation** Either (a) make the subscription registry durable/recoverable so a restarted actor can rebuild it (persist to local SQLite as the design doc says connection definitions are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the subscription registrations. At minimum document that subscribers must re-register after a crash and surface the lost-state condition rather than failing silently. **Resolution** Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures. `Resume` keeps the existing actor instance and all its in-memory subscription state (`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters) intact across a transient handler exception, so the design doc's "transparent re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful coordinator and its own Become/Stash reconnect state machine already recovers connection-level faults — it does not need a restart. This also aligns with the ScadaLink convention of `Resume` for coordinator actors. Regression test `DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor via a synchronously-throwing write and asserts the subscription survives (health report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart` code and passes after. Fixed by the commit whose message references `DataConnectionLayer-002` (commit ``). ### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization | | | |--|--| | Severity | High | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` | **Description** `_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the `DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line 163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined behaviour — they can throw `InvalidOperationException`, return wrong entries, or corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off the actor thread (finding 001), so multiple subscribe calls can also race each other here. **Recommendation** Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all access with a lock. Note that fixing finding 001 (serialising subscribe through the actor thread) reduces but does not eliminate the race, because the SDK notification threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` / `DisconnectAsync`. **Resolution** Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two `Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading `_callbacks`) concurrently with subscribe/disconnect mutations on other threads. `RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`, `Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA server, so the regression is guarded structurally by `DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary` code and passes after. Fixed by the commit whose message references `DataConnectionLayer-003` (commit ``). ### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly | | | |--|--| | Severity | High | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` | **Description** When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task, the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`, treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync` also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA client is not connected, and throws on transport faults — these are connection problems, not bad tag paths. They get misclassified as unresolved tags and retried on the 10 s tag-resolution timer instead of triggering the reconnection state machine. Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute must be marked quality `bad`; the code never pushes a bad-quality update to the subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`). **Recommendation** Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the reconnect machine handle them) from genuine node-not-found errors. For genuine resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing Instance Actor so it reflects the documented behaviour. **Resolution** Resolved 2026-05-16. The subscribe background task now classifies each subscribe exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException` — thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/ `IOException` count as connection-level; anything else is a genuine resolution failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure` and applied on the actor thread in `HandleSubscribeCompleted`: connection-level failures no longer become unresolved tags and instead drive the reconnection state machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags` and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2. Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber` and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail against the pre-fix code and pass after. Fixed by the commit whose message references `DataConnectionLayer-004` (commit ``). ### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied | | | |--|--| | Severity | High | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` | **Description** `DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared Settings" table both promise a bounded timeout for synchronous device writes. The value is never read anywhere in the module (`grep` confirms only the declaration). `HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no `CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling script's Ask blocks until its own ask-timeout — and the script gets no DCL-level error. The design states write failures (including timeout) must be returned synchronously to the script; an unbounded write violates that. **Recommendation** Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to `WriteAsync`, and in the continuation translate cancellation into a failed `WriteTagResponse` with a timeout error message. Apply the same to the read used by the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable. **Resolution** Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`, passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base `OperationCanceledException`) is translated into a failed `WriteTagResponse` with a `"Write timeout after Ns"` message, so a hung device write is bounded and the failure is returned synchronously to the calling script (WP-11) instead of blocking until the script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no change was needed there.) Regression test `DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose `WriteAsync` only completes when its token fires; it fails against the pre-fix unbounded code and passes after. Fixed by the commit whose message references `DataConnectionLayer-005` (commit ``). ### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` | **Description** `ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and `_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect) sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a reconnect, `HandleTagValueReceived` decrements the *old* bucket using `_lastTagQuality`'s value and increments the new one — but tags resolved for the first time after reconnect were never in `_lastTagQuality`, so they only increment, never decrement, and the totals can drift above `_totalSubscribed`. Over repeated disconnect/reconnect cycles the health report's good/bad/uncertain counts become unreliable. **Verification note**: Confirmed against source. The root cause is broader than the reconnect path the finding describes: `HandleUnsubscribe` also never removes a tag from `_lastTagQuality` nor decrements its quality bucket, so an unsubscribed tag lingers and `PushBadQualityForAllTags` (which sets `_tagsBadQuality = _lastTagQuality.Count`) over-counts it — driving the bad-quality count above `_totalSubscribed` even without a re-subscribe. Both the unsubscribe leak and the re-subscribe drift are real. **Recommendation** On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear `_lastTagQuality` and the three quality counters and let them be repopulated from fresh `TagValueReceived` messages. Alternatively recompute the buckets from `_lastTagQuality` whenever it changes rather than maintaining incremental counters. **Resolution** Resolved 2026-05-16 (commit pending). `HandleUnsubscribe` now removes each unsubscribed tag from `_lastTagQuality` and decrements the corresponding quality bucket, then reports the corrected counters via `UpdateTagQuality`/`UpdateTagResolution`; `ReSubscribeAll` clears `_lastTagQuality` and zeroes the three quality counters so post-reconnect tags are repopulated from fresh `TagValueReceived` messages instead of only incrementing. Regression test `DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags` subscribes two tags, pushes Good values, unsubscribes one, then disconnects and asserts `PushBadQualityForAllTags` reports exactly 1 bad tag (the reconnect is gated open so `ReSubscribeAll` does not run before the assertion); it reports 2 against the pre-fix code and 1 after. ### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` | **Description** `ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any non-cancellation exception (line 184). So if any single tag in the batch throws (bad node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no results for the tags that *did* read successfully — even though `ReadResult` already has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is also fully serial (one round-trip per tag), defeating the point of a batch API; the design doc lists `ReadBatch`/`WriteBatch` as first-class operations. **Verification note**: Confirmed against source — `ReadAsync` re-throws on any non-`OperationCanceledException`, aborting the whole batch. **Recommendation** Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a `ReadValueIdCollection` — extend it to accept multiple nodes). **Resolution** Resolved 2026-05-16 (commit pending). `ReadBatchAsync` now wraps each per-tag `ReadAsync` in a try/catch: a per-tag exception is recorded as a failed `ReadResult` (`Success: false`, message = the exception message) so the batch returns a complete result map for every requested tag; `OperationCanceledException` is still propagated so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call optimisation were deliberately left for a follow-up — they are a performance concern, not the correctness bug this finding raised. Regression test `DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails` reads three tags where the middle one throws and asserts all three appear in the result map with the failing one marked unsuccessful; it threw (no map returned) against the pre-fix code and passes after. ### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` | **Description** For each tag of the instance being removed, `HandleUnsubscribe` scans every other instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation O(tags x instances). On a site with many instances sharing a connection this is needlessly expensive on every instance stop/redeploy. Separately, line 562 re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead logic that obscures intent (the decrement of `_resolvedTags` is unconditional in practice). **Recommendation** Maintain a reference count per tag path (or a `tagPath -> set` reverse index) so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags` re-check or restructure so the resolved/unresolved decrement reflects the tag's actual prior state captured before removal. **Resolution** Resolved 2026-05-16 (commit pending). A `tagPath → subscriber-count` reverse index (`_tagSubscriberCount`) was added: `HandleSubscribeCompleted` increments it whenever a tag is newly added to an instance's set, and `HandleUnsubscribe` decrements it, releasing a tag at the adapter only when the count reaches zero. The "any other subscriber" check is now O(1) per tag instead of an O(instances) `Where(...).Any()` scan. The redundant `!_unresolvedTags.Contains(tagPath)` re-check (always true after the unconditional `Remove` on the line above) was removed — the surviving branch is entered only for tags that have a subscription id, which are by definition resolved, so `_resolvedTags--` is now unconditional with an explanatory comment. The cleanup also fixed a latent leak the original code could not reach: an unresolved tag whose last subscriber unsubscribes is now removed from `_unresolvedTags`/`_resolutionInFlight` and decremented from `_totalSubscribed` (previously it lingered in the retry timer and the subscribed total forever). Regression test `DCL008_Unsubscribe_OnlyReleasesTagWhenLastSubscriberLeaves` subscribes a tag from two instances plus an exclusive tag, then unsubscribes each instance and asserts the shared tag is released at the adapter only after the last subscriber leaves and the health counters track correctly. (This finding is a performance refactor, not a correctness bug — the pre-fix `Where(...).Any()` logic was functionally correct, so the test passes against both versions and serves as a behavioural guard for the refactor.) ### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine | | | |--|--| | Severity | Medium — partially design-doc work outside this module's editable scope | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` | **Description** The design doc's failover state machine reads "retry active endpoint (5s) -> N failures (>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate* failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can happen on *successful-but-flaky* connections are not described in the component doc at all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not predict this behaviour, and the 60 s threshold is a magic constant not exposed via `DataConnectionOptions`. **Verification note**: Confirmed against source. The hard-coded `StableConnectionThreshold = TimeSpan.FromSeconds(60)` `static readonly` field and the `_consecutiveUnstableDisconnects` failover path both exist as described. **Recommendation** Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover path and the stability threshold, and move the 60 s threshold into `DataConnectionOptions` so it is configurable and consistent with the other tunables. **Resolution** Resolved 2026-05-16 (commit pending). The configurability half of the recommendation is done: the hard-coded `StableConnectionThreshold` constant was removed from `DataConnectionActor` and replaced with a new `DataConnectionOptions.StableConnectionThreshold` property (60 s default), bindable from the `DataConnectionLayer` `appsettings.json` section like `ReconnectInterval`/`TagResolutionRetryInterval`/`WriteTimeout`. Regression test `DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault` guards the default and the setter. **The documentation half is out of this module's editable scope** — `docs/requirements/Component-DataConnectionLayer.md` (lines 73-85) still describes only the connect-failure failover path and does not mention the unstable-disconnect trigger. **Action required (surfaced):** the DCL design doc should be updated to document the unstable-disconnect failover path and the configurable stability threshold; that edit was deliberately not made here because this task is scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only. ### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` | **Description** `HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags` via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from `_unresolvedTags` while the attempts are in flight. Because tags are not removed before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues duplicate concurrent subscribe attempts for the same tag, which can create duplicate monitored items / leaked subscription IDs (the second success overwrites `_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle with no `UnsubscribeAsync` call). The timer-cancel condition in `HandleTagResolutionSucceeded` is also non-deterministic for the same reason. **Verification note**: Confirmed against source — `HandleRetryTagResolution` dispatched `SubscribeAsync` for every tag in `_unresolvedTags` on every tick with no in-flight guard. **Recommendation** Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is dispatched, and only put them back on failure. This prevents overlapping duplicate subscribe attempts and makes the timer-cancel condition deterministic. **Resolution** Resolved 2026-05-16 (commit pending). A new `_resolutionInFlight` `HashSet` tracks tags whose retry `SubscribeAsync` is currently outstanding. `HandleRetryTagResolution` now dispatches only for unresolved tags **not** already in flight (and skips entirely if all are in flight), adding each dispatched tag to the set; `HandleTagResolutionSucceeded` and `HandleTagResolutionFailed` remove the tag from the set when its attempt completes, and `HandleUnsubscribe`/`ReSubscribeAll` clear stale entries. This prevents overlapping duplicate subscribe attempts and the resulting orphaned monitored items. Regression test `DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes` gives a tag a genuine initial failure then a retry `SubscribeAsync` that never completes, lets six 100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total subscribe calls); the pre-fix code dispatched on every tick (6 total). ### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` | **Description** On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`, fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the `Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches `Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads have not yet fully torn down could still deliver a value, mixing pre-failover device data with the new endpoint's data and briefly reporting a value the active endpoint never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to distinguish current from stale callbacks. **Verification note**: Confirmed against source — `TagValueReceived` carried no adapter identity, and `HandleTagValueReceived` (reachable in `Connected`) processed any such message regardless of which adapter produced it. **Recommendation** Add an adapter-generation counter incremented on every adapter swap; stamp it onto `TagValueReceived` (captured in the callback closure) and drop messages whose generation does not match the current adapter in `HandleTagValueReceived`. **Resolution** Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new `_adapterGeneration` `int` field is incremented at both adapter-swap sites (the unstable-disconnect failover in `BecomeReconnecting` and the connect-failure failover in `HandleReconnectResult`). The `TagValueReceived` record gained an `AdapterGeneration` field; every subscription callback closure (`HandleSubscribe`, the initial-read seed, `HandleRetryTagResolution`, `ReSubscribeAll`) captures the generation in effect at subscribe time and stamps it onto each `TagValueReceived`. `HandleTagValueReceived` drops any message whose generation no longer matches the current adapter, so a callback fired by a disposed adapter after failover cannot reach an Instance Actor. Regression test `DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover` subscribes on the primary, fails over to the backup, then invokes the captured primary callback with a stale value and asserts the subscriber receives nothing; the stale value reached the subscriber against the pre-fix code and is dropped after. ### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate | | | |--|--| | Severity | Medium — full secure default also requires a Commons + design-doc change outside this module | | Category | Security | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` | **Description** `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and `RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true` when it is set. With the default, every server certificate is accepted unconditionally — there is no certificate-pinning or trust-store enforcement — which defeats the `Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC UA link. The design doc explicitly lists `true` as the default. For an industrial control link this is a meaningful exposure; a secure-by-default posture would reject untrusted certs unless an operator opts in per connection. **Verification note**: Confirmed against source. Note the *authoritative* runtime default does not actually live on `OpcUaConnectionOptions` — for a real connection `OpcUaDataConnection.ConnectAsync` builds `OpcUaConnectionOptions` from `OpcUaEndpointConfig` (in `ScadaLink.Commons`), whose `AutoAcceptUntrustedCerts` property also defaults to `true`. `OpcUaConnectionOptions`' own default is only the fallback used when an `OpcUaConnectionOptions` is constructed directly. **Recommendation** Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection opt-in, or at minimum log a prominent warning whenever the auto-accept validator is installed. Update the design doc to reflect the secure default. **Resolution** Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation are done: (1) `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` now defaults to `false`; (2) `RealOpcUaClient.ConnectAsync` logs a prominent `ILogger` warning whenever the auto-accept certificate validator is installed (an `ILogger` was added as an optional constructor parameter, defaulting to `NullLogger`, so existing callers are unaffected). Regression test `DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse` guards the new secure default. **Two parts remain outside this module's editable scope and are surfaced as action required:** (a) `ScadaLink.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts` still defaults to `true` — since that is the value actually used for a real connection (see verification note above), the Commons default must also be flipped to `false` for the system to be secure-by-default; (b) `docs/requirements/Component-DataConnectionLayer.md` line 116 still documents `true` as the default and must be updated. Both edits were deliberately not made here because this task is scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only. ### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` | **Description** The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers the event." The implementation is a non-atomic check-then-set on a `volatile bool` (`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via `OnClientConnectionLost` and a `ReadAsync` failure path) can both observe `_disconnectFired == false` and both invoke `Disconnected`. In practice the `DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact is low, but the comment overstates the guarantee. The same pattern exists in `RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`). **Recommendation** Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock), or correct the comment to say "best-effort once-only; a duplicate event is possible under a race and is tolerated downstream." **Resolution** Resolved 2026-05-16 (commit pending). Rather than weaken the XML comment to match the weak guard, the guard was made genuinely atomic so the documented "only the first caller fires the event" guarantee becomes true. `OpcUaDataConnection._disconnectFired` and `RealOpcUaClient._connectionLostFired` were changed from `volatile bool` to `int`, and the check-then-set in `RaiseDisconnected` / `OnSessionKeepAlive` replaced with a single `Interlocked.Exchange(ref flag, 1) != 0` compare-and-set; the reset on connect uses `Interlocked.Exchange(ref flag, 0)`. The XML comments on both methods were updated to describe the atomic compare-and-set explicitly. Regression test `DCL013_ConcurrentConnectionLost_RaisesDisconnectedExactlyOnce` runs 25 rounds, each fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost` event simultaneously, and asserts `Disconnected` fires exactly once per round; against a non-atomic check-then-set it double-fires (verified by temporarily reverting the guard), and it passes against the atomic fix.