# Code Review — DataConnectionLayer | Field | Value | |-------|-------| | Module | `src/ScadaLink.DataConnectionLayer` | | Design doc | `docs/requirements/Component-DataConnectionLayer.md` | | Status | Reviewed | | Last reviewed | 2026-05-16 | | Reviewer | claude-agent | | Commit reviewed | `9c60592` | | Open findings | 2 | ## Summary The DataConnectionLayer is a reasonably well-structured module: the Become/Stash lifecycle state machine, the captured-`Self` marshalling of background-thread disconnect events, and the protocol-factory abstraction all follow the design doc and Akka.NET conventions. However, the review found one **critical** actor-model violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private dictionaries and counters from a thread-pool thread, racing with the actor's own message loop. Several **high**-severity issues cluster around concurrency and error handling: the subscription-failure path leaves the connection with degraded subtrees but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored- item callback dictionary is mutated without synchronization while OPC UA notification threads read it. The remaining findings concern stale health counters after failover, an unused `WriteTimeout` option (writes are unbounded despite the design promising a 30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between the design doc's failover state machine and the implemented unstable-disconnect heuristic. Test coverage is adequate for the happy paths and failover but absent for tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. | | 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. | | 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. | | 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. | | 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. | | 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. | | 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. | | 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. | | 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. | | 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. | ## Findings ### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread | | | |--|--| | Severity | Critical | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` | **Description** `HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool thread and directly mutates the actor's private mutable state: `instanceTags` (a reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`, `_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`, `ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a direct violation of the Akka.NET actor model, which guarantees single-threaded access to actor state only when state is touched on the actor thread. Two concurrent subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`, produce data races on `Dictionary`/`HashSet`/`int` — `Dictionary` is not thread-safe and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can also produce torn reads of the health counters. **Recommendation** Do not mutate actor state from the background task. Perform only the `await _adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results into a local immutable result object, and `PipeTo(Self)` an internal message (e.g. `SubscribeCompleted`) whose handler — running on the actor thread — applies all state mutations and counter updates. The response to `Sender` should be sent from that handler too. **Resolution** Resolved 2026-05-16. `HandleSubscribe` was restructured to follow the actor's own `PipeTo(Self)` pattern (the one already used by `HandleRetryTagResolution`): the background `Task.Run` now performs only adapter I/O (`SubscribeAsync`/`ReadAsync`), collects per-tag outcomes into an immutable `SubscribeCompleted` message, and pipes that to `Self`. All mutation of `_subscriptionIds`, `_subscriptionsByInstance`, `_totalSubscribed`, `_resolvedTags` and `_unresolvedTags` now happens in the new `HandleSubscribeCompleted` handler on the actor thread; it is wired into the Connected, Connecting and Reconnecting states so an in-flight subscribe is applied regardless of state transitions. Regression test `DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters` (30×30 concurrent subscribes) fails against the pre-fix code and passes after. Fixed by the commit whose message references `DataConnectionLayer-001`. ### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash | | | |--|--| | Severity | High | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` | **Description** `DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with `Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors that had subscribed before the crash never get their tags re-subscribed and will sit at uncertain/stale quality indefinitely with no error returned. There is no durable subscription store from which a restarted actor could rebuild state. **Recommendation** Either (a) make the subscription registry durable/recoverable so a restarted actor can rebuild it (persist to local SQLite as the design doc says connection definitions are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the subscription registrations. At minimum document that subscribers must re-register after a crash and surface the lost-state condition rather than failing silently. **Resolution** Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures. `Resume` keeps the existing actor instance and all its in-memory subscription state (`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters) intact across a transient handler exception, so the design doc's "transparent re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful coordinator and its own Become/Stash reconnect state machine already recovers connection-level faults — it does not need a restart. This also aligns with the ScadaLink convention of `Resume` for coordinator actors. Regression test `DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor via a synchronously-throwing write and asserts the subscription survives (health report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart` code and passes after. Fixed by the commit whose message references `DataConnectionLayer-002` (commit ``). ### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization | | | |--|--| | Severity | High | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` | **Description** `_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the `DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line 163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined behaviour — they can throw `InvalidOperationException`, return wrong entries, or corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off the actor thread (finding 001), so multiple subscribe calls can also race each other here. **Recommendation** Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all access with a lock. Note that fixing finding 001 (serialising subscribe through the actor thread) reduces but does not eliminate the race, because the SDK notification threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` / `DisconnectAsync`. **Resolution** Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two `Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading `_callbacks`) concurrently with subscribe/disconnect mutations on other threads. `RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`, `Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA server, so the regression is guarded structurally by `DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary` code and passes after. Fixed by the commit whose message references `DataConnectionLayer-003` (commit ``). ### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly | | | |--|--| | Severity | High | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` | **Description** When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task, the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`, treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync` also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA client is not connected, and throws on transport faults — these are connection problems, not bad tag paths. They get misclassified as unresolved tags and retried on the 10 s tag-resolution timer instead of triggering the reconnection state machine. Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute must be marked quality `bad`; the code never pushes a bad-quality update to the subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`). **Recommendation** Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the reconnect machine handle them) from genuine node-not-found errors. For genuine resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing Instance Actor so it reflects the documented behaviour. **Resolution** Resolved 2026-05-16. The subscribe background task now classifies each subscribe exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException` — thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/ `IOException` count as connection-level; anything else is a genuine resolution failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure` and applied on the actor thread in `HandleSubscribeCompleted`: connection-level failures no longer become unresolved tags and instead drive the reconnection state machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags` and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2. Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber` and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail against the pre-fix code and pass after. Fixed by the commit whose message references `DataConnectionLayer-004` (commit ``). ### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied | | | |--|--| | Severity | High | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` | **Description** `DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared Settings" table both promise a bounded timeout for synchronous device writes. The value is never read anywhere in the module (`grep` confirms only the declaration). `HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no `CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling script's Ask blocks until its own ask-timeout — and the script gets no DCL-level error. The design states write failures (including timeout) must be returned synchronously to the script; an unbounded write violates that. **Recommendation** Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to `WriteAsync`, and in the continuation translate cancellation into a failed `WriteTagResponse` with a timeout error message. Apply the same to the read used by the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable. **Resolution** Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`, passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base `OperationCanceledException`) is translated into a failed `WriteTagResponse` with a `"Write timeout after Ns"` message, so a hung device write is bounded and the failure is returned synchronously to the calling script (WP-11) instead of blocking until the script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no change was needed there.) Regression test `DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose `WriteAsync` only completes when its token fires; it fails against the pre-fix unbounded code and passes after. Fixed by the commit whose message references `DataConnectionLayer-005` (commit ``). ### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` | **Description** `ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and `_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect) sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a reconnect, `HandleTagValueReceived` decrements the *old* bucket using `_lastTagQuality`'s value and increments the new one — but tags resolved for the first time after reconnect were never in `_lastTagQuality`, so they only increment, never decrement, and the totals can drift above `_totalSubscribed`. Over repeated disconnect/reconnect cycles the health report's good/bad/uncertain counts become unreliable. **Verification note**: Confirmed against source. The root cause is broader than the reconnect path the finding describes: `HandleUnsubscribe` also never removes a tag from `_lastTagQuality` nor decrements its quality bucket, so an unsubscribed tag lingers and `PushBadQualityForAllTags` (which sets `_tagsBadQuality = _lastTagQuality.Count`) over-counts it — driving the bad-quality count above `_totalSubscribed` even without a re-subscribe. Both the unsubscribe leak and the re-subscribe drift are real. **Recommendation** On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear `_lastTagQuality` and the three quality counters and let them be repopulated from fresh `TagValueReceived` messages. Alternatively recompute the buckets from `_lastTagQuality` whenever it changes rather than maintaining incremental counters. **Resolution** Resolved 2026-05-16 (commit pending). `HandleUnsubscribe` now removes each unsubscribed tag from `_lastTagQuality` and decrements the corresponding quality bucket, then reports the corrected counters via `UpdateTagQuality`/`UpdateTagResolution`; `ReSubscribeAll` clears `_lastTagQuality` and zeroes the three quality counters so post-reconnect tags are repopulated from fresh `TagValueReceived` messages instead of only incrementing. Regression test `DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags` subscribes two tags, pushes Good values, unsubscribes one, then disconnects and asserts `PushBadQualityForAllTags` reports exactly 1 bad tag (the reconnect is gated open so `ReSubscribeAll` does not run before the assertion); it reports 2 against the pre-fix code and 1 after. ### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` | **Description** `ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any non-cancellation exception (line 184). So if any single tag in the batch throws (bad node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no results for the tags that *did* read successfully — even though `ReadResult` already has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is also fully serial (one round-trip per tag), defeating the point of a batch API; the design doc lists `ReadBatch`/`WriteBatch` as first-class operations. **Verification note**: Confirmed against source — `ReadAsync` re-throws on any non-`OperationCanceledException`, aborting the whole batch. **Recommendation** Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a `ReadValueIdCollection` — extend it to accept multiple nodes). **Resolution** Resolved 2026-05-16 (commit pending). `ReadBatchAsync` now wraps each per-tag `ReadAsync` in a try/catch: a per-tag exception is recorded as a failed `ReadResult` (`Success: false`, message = the exception message) so the batch returns a complete result map for every requested tag; `OperationCanceledException` is still propagated so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call optimisation were deliberately left for a follow-up — they are a performance concern, not the correctness bug this finding raised. Regression test `DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails` reads three tags where the middle one throws and asserts all three appear in the result map with the failing one marked unsuccessful; it threw (no map returned) against the pre-fix code and passes after. ### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Open | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` | **Description** For each tag of the instance being removed, `HandleUnsubscribe` scans every other instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation O(tags x instances). On a site with many instances sharing a connection this is needlessly expensive on every instance stop/redeploy. Separately, line 562 re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead logic that obscures intent (the decrement of `_resolvedTags` is unconditional in practice). **Recommendation** Maintain a reference count per tag path (or a `tagPath -> set` reverse index) so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags` re-check or restructure so the resolved/unresolved decrement reflects the tag's actual prior state captured before removal. **Resolution** _Unresolved._ ### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine | | | |--|--| | Severity | Medium — partially design-doc work outside this module's editable scope | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` | **Description** The design doc's failover state machine reads "retry active endpoint (5s) -> N failures (>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate* failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can happen on *successful-but-flaky* connections are not described in the component doc at all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not predict this behaviour, and the 60 s threshold is a magic constant not exposed via `DataConnectionOptions`. **Verification note**: Confirmed against source. The hard-coded `StableConnectionThreshold = TimeSpan.FromSeconds(60)` `static readonly` field and the `_consecutiveUnstableDisconnects` failover path both exist as described. **Recommendation** Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover path and the stability threshold, and move the 60 s threshold into `DataConnectionOptions` so it is configurable and consistent with the other tunables. **Resolution** Resolved 2026-05-16 (commit pending). The configurability half of the recommendation is done: the hard-coded `StableConnectionThreshold` constant was removed from `DataConnectionActor` and replaced with a new `DataConnectionOptions.StableConnectionThreshold` property (60 s default), bindable from the `DataConnectionLayer` `appsettings.json` section like `ReconnectInterval`/`TagResolutionRetryInterval`/`WriteTimeout`. Regression test `DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault` guards the default and the setter. **The documentation half is out of this module's editable scope** — `docs/requirements/Component-DataConnectionLayer.md` (lines 73-85) still describes only the connect-failure failover path and does not mention the unstable-disconnect trigger. **Action required (surfaced):** the DCL design doc should be updated to document the unstable-disconnect failover path and the configurable stability threshold; that edit was deliberately not made here because this task is scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only. ### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` | **Description** `HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags` via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from `_unresolvedTags` while the attempts are in flight. Because tags are not removed before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues duplicate concurrent subscribe attempts for the same tag, which can create duplicate monitored items / leaked subscription IDs (the second success overwrites `_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle with no `UnsubscribeAsync` call). The timer-cancel condition in `HandleTagResolutionSucceeded` is also non-deterministic for the same reason. **Verification note**: Confirmed against source — `HandleRetryTagResolution` dispatched `SubscribeAsync` for every tag in `_unresolvedTags` on every tick with no in-flight guard. **Recommendation** Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is dispatched, and only put them back on failure. This prevents overlapping duplicate subscribe attempts and makes the timer-cancel condition deterministic. **Resolution** Resolved 2026-05-16 (commit pending). A new `_resolutionInFlight` `HashSet` tracks tags whose retry `SubscribeAsync` is currently outstanding. `HandleRetryTagResolution` now dispatches only for unresolved tags **not** already in flight (and skips entirely if all are in flight), adding each dispatched tag to the set; `HandleTagResolutionSucceeded` and `HandleTagResolutionFailed` remove the tag from the set when its attempt completes, and `HandleUnsubscribe`/`ReSubscribeAll` clear stale entries. This prevents overlapping duplicate subscribe attempts and the resulting orphaned monitored items. Regression test `DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes` gives a tag a genuine initial failure then a retry `SubscribeAsync` that never completes, lets six 100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total subscribe calls); the pre-fix code dispatched on every tick (6 total). ### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` | **Description** On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`, fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the `Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches `Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads have not yet fully torn down could still deliver a value, mixing pre-failover device data with the new endpoint's data and briefly reporting a value the active endpoint never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to distinguish current from stale callbacks. **Verification note**: Confirmed against source — `TagValueReceived` carried no adapter identity, and `HandleTagValueReceived` (reachable in `Connected`) processed any such message regardless of which adapter produced it. **Recommendation** Add an adapter-generation counter incremented on every adapter swap; stamp it onto `TagValueReceived` (captured in the callback closure) and drop messages whose generation does not match the current adapter in `HandleTagValueReceived`. **Resolution** Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new `_adapterGeneration` `int` field is incremented at both adapter-swap sites (the unstable-disconnect failover in `BecomeReconnecting` and the connect-failure failover in `HandleReconnectResult`). The `TagValueReceived` record gained an `AdapterGeneration` field; every subscription callback closure (`HandleSubscribe`, the initial-read seed, `HandleRetryTagResolution`, `ReSubscribeAll`) captures the generation in effect at subscribe time and stamps it onto each `TagValueReceived`. `HandleTagValueReceived` drops any message whose generation no longer matches the current adapter, so a callback fired by a disposed adapter after failover cannot reach an Instance Actor. Regression test `DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover` subscribes on the primary, fails over to the backup, then invokes the captured primary callback with a stale value and asserts the subscriber receives nothing; the stale value reached the subscriber against the pre-fix code and is dropped after. ### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate | | | |--|--| | Severity | Medium — full secure default also requires a Commons + design-doc change outside this module | | Category | Security | | Status | Resolved | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` | **Description** `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and `RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true` when it is set. With the default, every server certificate is accepted unconditionally — there is no certificate-pinning or trust-store enforcement — which defeats the `Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC UA link. The design doc explicitly lists `true` as the default. For an industrial control link this is a meaningful exposure; a secure-by-default posture would reject untrusted certs unless an operator opts in per connection. **Verification note**: Confirmed against source. Note the *authoritative* runtime default does not actually live on `OpcUaConnectionOptions` — for a real connection `OpcUaDataConnection.ConnectAsync` builds `OpcUaConnectionOptions` from `OpcUaEndpointConfig` (in `ScadaLink.Commons`), whose `AutoAcceptUntrustedCerts` property also defaults to `true`. `OpcUaConnectionOptions`' own default is only the fallback used when an `OpcUaConnectionOptions` is constructed directly. **Recommendation** Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection opt-in, or at minimum log a prominent warning whenever the auto-accept validator is installed. Update the design doc to reflect the secure default. **Resolution** Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation are done: (1) `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` now defaults to `false`; (2) `RealOpcUaClient.ConnectAsync` logs a prominent `ILogger` warning whenever the auto-accept certificate validator is installed (an `ILogger` was added as an optional constructor parameter, defaulting to `NullLogger`, so existing callers are unaffected). Regression test `DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse` guards the new secure default. **Two parts remain outside this module's editable scope and are surfaced as action required:** (a) `ScadaLink.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts` still defaults to `true` — since that is the value actually used for a real connection (see verification note above), the Commons default must also be flipped to `false` for the system to be secure-by-default; (b) `docs/requirements/Component-DataConnectionLayer.md` line 116 still documents `true` as the default and must be updated. Both edits were deliberately not made here because this task is scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only. ### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Open | | Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` | **Description** The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers the event." The implementation is a non-atomic check-then-set on a `volatile bool` (`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via `OnClientConnectionLost` and a `ReadAsync` failure path) can both observe `_disconnectFired == false` and both invoke `Disconnected`. In practice the `DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact is low, but the comment overstates the guarantee. The same pattern exists in `RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`). **Recommendation** Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock), or correct the comment to say "best-effort once-only; a duplicate event is possible under a race and is tolerated downstream." **Resolution** _Unresolved._