Files
scadalink-design/code-reviews/DataConnectionLayer/findings.md

633 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — DataConnectionLayer
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 2 |
## Summary
The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
lifecycle state machine, the captured-`Self` marshalling of background-thread
disconnect events, and the protocol-factory abstraction all follow the design doc
and Akka.NET conventions. However, the review found one **critical** actor-model
violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private
dictionaries and counters from a thread-pool thread, racing with the actor's own
message loop. Several **high**-severity issues cluster around concurrency and error
handling: the subscription-failure path leaves the connection with degraded subtrees
but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops
all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored-
item callback dictionary is mutated without synchronization while OPC UA notification
threads read it. The remaining findings concern stale health counters after failover,
an unused `WriteTimeout` option (writes are unbounded despite the design promising a
30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between
the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
## Findings
### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
| | |
|--|--|
| Severity | Critical |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` |
**Description**
`HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool
thread and directly mutates the actor's private mutable state: `instanceTags` (a
reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`,
`_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and
written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`,
`ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a
direct violation of the Akka.NET actor model, which guarantees single-threaded access
to actor state only when state is touched on the actor thread. Two concurrent
subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`,
produce data races on `Dictionary`/`HashSet`/`int``Dictionary` is not thread-safe
and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
also produce torn reads of the health counters.
**Recommendation**
Do not mutate actor state from the background task. Perform only the `await
_adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results
into a local immutable result object, and `PipeTo(Self)` an internal message (e.g.
`SubscribeCompleted`) whose handler — running on the actor thread — applies all state
mutations and counter updates. The response to `Sender` should be sent from that
handler too.
**Resolution**
Resolved 2026-05-16. `HandleSubscribe` was restructured to follow the actor's own
`PipeTo(Self)` pattern (the one already used by `HandleRetryTagResolution`): the
background `Task.Run` now performs only adapter I/O (`SubscribeAsync`/`ReadAsync`),
collects per-tag outcomes into an immutable `SubscribeCompleted` message, and pipes
that to `Self`. All mutation of `_subscriptionIds`, `_subscriptionsByInstance`,
`_totalSubscribed`, `_resolvedTags` and `_unresolvedTags` now happens in the new
`HandleSubscribeCompleted` handler on the actor thread; it is wired into the
Connected, Connecting and Reconnecting states so an in-flight subscribe is applied
regardless of state transitions. Regression test
`DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters` (30×30 concurrent
subscribes) fails against the pre-fix code and passes after. Fixed by the commit
whose message references `DataConnectionLayer-001`.
### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash
| | |
|--|--|
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
**Description**
`DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with
`Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a
fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`,
`_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are
silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the
design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
that had subscribed before the crash never get their tags re-subscribed and will sit
at uncertain/stale quality indefinitely with no error returned. There is no durable
subscription store from which a restarted actor could rebuild state.
**Recommendation**
Either (a) make the subscription registry durable/recoverable so a restarted actor
can rebuild it (persist to local SQLite as the design doc says connection definitions
are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash
as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the
subscription registrations. At minimum document that subscribers must re-register
after a crash and surface the lost-state condition rather than failing silently.
**Resolution**
Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed
from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures.
`Resume` keeps the existing actor instance and all its in-memory subscription state
(`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters)
intact across a transient handler exception, so the design doc's "transparent
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
coordinator and its own Become/Stash reconnect state machine already recovers
connection-level faults — it does not need a restart. This also aligns with the
ScadaLink convention of `Resume` for coordinator actors. Regression test
`DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor
via a synchronously-throwing write and asserts the subscription survives (health
report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart`
code and passes after. Fixed by the commit whose message references
`DataConnectionLayer-002` (commit `<pending>`).
### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
**Description**
`_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are
written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the
`DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool
threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from
the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on
the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line
163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined
behaviour — they can throw `InvalidOperationException`, return wrong entries, or
corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off
the actor thread (finding 001), so multiple subscribe calls can also race each other
here.
**Recommendation**
Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all
access with a lock. Note that fixing finding 001 (serialising subscribe through the
actor thread) reduces but does not eliminate the race, because the SDK notification
threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
`DisconnectAsync`.
**Resolution**
Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were
changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two
`Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read
from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading
`_callbacks`) concurrently with subscribe/disconnect mutations on other threads.
`RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`,
`Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA
server, so the regression is guarded structurally by
`DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting
both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary`
code and passes after. Fixed by the commit whose message references
`DataConnectionLayer-003` (commit `<pending>`).
### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
**Description**
When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task,
the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`,
treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync`
also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA
client is not connected, and throws on transport faults — these are connection
problems, not bad tag paths. They get misclassified as unresolved tags and retried on
the 10 s tag-resolution timer instead of triggering the reconnection state machine.
Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
must be marked quality `bad`; the code never pushes a bad-quality update to the
subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends
to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`).
**Recommendation**
Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the
reconnect machine handle them) from genuine node-not-found errors. For genuine
resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing
Instance Actor so it reflects the documented behaviour.
**Resolution**
Resolved 2026-05-16. The subscribe background task now classifies each subscribe
exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException`
— thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/
`IOException` count as connection-level; anything else is a genuine resolution
failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure`
and applied on the actor thread in `HandleSubscribeCompleted`: connection-level
failures no longer become unresolved tags and instead drive the reconnection state
machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler
calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags`
and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber`
and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail
against the pre-fix code and pass after. Fixed by the commit whose message references
`DataConnectionLayer-004` (commit `<pending>`).
### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
**Description**
`DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared
Settings" table both promise a bounded timeout for synchronous device writes. The
value is never read anywhere in the module (`grep` confirms only the declaration).
`HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no
`CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no
RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling
script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
error. The design states write failures (including timeout) must be returned
synchronously to the script; an unbounded write violates that.
**Recommendation**
Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to
`WriteAsync`, and in the continuation translate cancellation into a failed
`WriteTagResponse` with a timeout error message. Apply the same to the read used by
the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable.
**Resolution**
Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`,
passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the
continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base
`OperationCanceledException`) is translated into a failed `WriteTagResponse` with a
`"Write timeout after Ns"` message, so a hung device write is bounded and the failure
is returned synchronously to the calling script (WP-11) instead of blocking until the
script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts
an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no
change was needed there.) Regression test
`DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose
`WriteAsync` only completes when its token fires; it fails against the pre-fix
unbounded code and passes after. Fixed by the commit whose message references
`DataConnectionLayer-005` (commit `<pending>`).
### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` |
**Description**
`ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a
clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and
`_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect)
sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a
reconnect, `HandleTagValueReceived` decrements the *old* bucket using
`_lastTagQuality`'s value and increments the new one — but tags resolved for the first
time after reconnect were never in `_lastTagQuality`, so they only increment, never
decrement, and the totals can drift above `_totalSubscribed`. Over repeated
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
unreliable.
**Verification note**: Confirmed against source. The root cause is broader than the
reconnect path the finding describes: `HandleUnsubscribe` also never removes a tag
from `_lastTagQuality` nor decrements its quality bucket, so an unsubscribed tag
lingers and `PushBadQualityForAllTags` (which sets `_tagsBadQuality =
_lastTagQuality.Count`) over-counts it — driving the bad-quality count above
`_totalSubscribed` even without a re-subscribe. Both the unsubscribe leak and the
re-subscribe drift are real.
**Recommendation**
On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear
`_lastTagQuality` and the three quality counters and let them be repopulated from
fresh `TagValueReceived` messages. Alternatively recompute the buckets from
`_lastTagQuality` whenever it changes rather than maintaining incremental counters.
**Resolution**
Resolved 2026-05-16 (commit pending). `HandleUnsubscribe` now removes each
unsubscribed tag from `_lastTagQuality` and decrements the corresponding quality
bucket, then reports the corrected counters via `UpdateTagQuality`/`UpdateTagResolution`;
`ReSubscribeAll` clears `_lastTagQuality` and zeroes the three quality counters so
post-reconnect tags are repopulated from fresh `TagValueReceived` messages instead of
only incrementing. Regression test
`DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags` subscribes two
tags, pushes Good values, unsubscribes one, then disconnects and asserts
`PushBadQualityForAllTags` reports exactly 1 bad tag (the reconnect is gated open so
`ReSubscribeAll` does not run before the assertion); it reports 2 against the pre-fix
code and 1 after.
### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` |
**Description**
`ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any
non-cancellation exception (line 184). So if any single tag in the batch throws (bad
node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no
results for the tags that *did* read successfully — even though `ReadResult` already
has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is
also fully serial (one round-trip per tag), defeating the point of a batch API; the
design doc lists `ReadBatch`/`WriteBatch` as first-class operations.
**Verification note**: Confirmed against source — `ReadAsync` re-throws on any
non-`OperationCanceledException`, aborting the whole batch.
**Recommendation**
Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag
so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call
for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a
`ReadValueIdCollection` — extend it to accept multiple nodes).
**Resolution**
Resolved 2026-05-16 (commit pending). `ReadBatchAsync` now wraps each per-tag
`ReadAsync` in a try/catch: a per-tag exception is recorded as a failed `ReadResult`
(`Success: false`, message = the exception message) so the batch returns a complete
result map for every requested tag; `OperationCanceledException` is still propagated
so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call
optimisation were deliberately left for a follow-up — they are a performance concern,
not the correctness bug this finding raised. Regression test
`DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails` reads three tags where the
middle one throws and asserts all three appear in the result map with the failing one
marked unsuccessful; it threw (no map returned) against the pre-fix code and passes
after.
### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` |
**Description**
For each tag of the instance being removed, `HandleUnsubscribe` scans every other
instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation
O(tags x instances). On a site with many instances sharing a connection this is
needlessly expensive on every instance stop/redeploy. Separately, line 562
re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already
removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead
logic that obscures intent (the decrement of `_resolvedTags` is unconditional in
practice).
**Recommendation**
Maintain a reference count per tag path (or a `tagPath -> set<instance>` reverse index)
so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags`
re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
prior state captured before removal.
**Resolution**
_Unresolved._
### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine
| | |
|--|--|
| Severity | Medium — partially design-doc work outside this module's editable scope |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` |
**Description**
The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
(>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate*
failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on
connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally
counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped
within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count
too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
happen on *successful-but-flaky* connections are not described in the component doc at
all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
`DataConnectionOptions`.
**Verification note**: Confirmed against source. The hard-coded
`StableConnectionThreshold = TimeSpan.FromSeconds(60)` `static readonly` field and the
`_consecutiveUnstableDisconnects` failover path both exist as described.
**Recommendation**
Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover
path and the stability threshold, and move the 60 s threshold into
`DataConnectionOptions` so it is configurable and consistent with the other tunables.
**Resolution**
Resolved 2026-05-16 (commit pending). The configurability half of the recommendation
is done: the hard-coded `StableConnectionThreshold` constant was removed from
`DataConnectionActor` and replaced with a new `DataConnectionOptions.StableConnectionThreshold`
property (60 s default), bindable from the `DataConnectionLayer` `appsettings.json`
section like `ReconnectInterval`/`TagResolutionRetryInterval`/`WriteTimeout`. Regression
test `DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault` guards
the default and the setter. **The documentation half is out of this module's editable
scope** — `docs/requirements/Component-DataConnectionLayer.md` (lines 73-85) still
describes only the connect-failure failover path and does not mention the
unstable-disconnect trigger. **Action required (surfaced):** the DCL design doc should
be updated to document the unstable-disconnect failover path and the configurable
stability threshold; that edit was deliberately not made here because this task is
scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only.
### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` |
**Description**
`HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags`
via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from
`_unresolvedTags` while the attempts are in flight. Because tags are not removed
before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues
duplicate concurrent subscribe attempts for the same tag, which can create duplicate
monitored items / leaked subscription IDs (the second success overwrites
`_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle
with no `UnsubscribeAsync` call). The timer-cancel condition in
`HandleTagResolutionSucceeded` is also non-deterministic for the same reason.
**Verification note**: Confirmed against source — `HandleRetryTagResolution` dispatched
`SubscribeAsync` for every tag in `_unresolvedTags` on every tick with no in-flight
guard.
**Recommendation**
Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is
dispatched, and only put them back on failure. This prevents overlapping duplicate
subscribe attempts and makes the timer-cancel condition deterministic.
**Resolution**
Resolved 2026-05-16 (commit pending). A new `_resolutionInFlight` `HashSet<string>`
tracks tags whose retry `SubscribeAsync` is currently outstanding.
`HandleRetryTagResolution` now dispatches only for unresolved tags **not** already in
flight (and skips entirely if all are in flight), adding each dispatched tag to the
set; `HandleTagResolutionSucceeded` and `HandleTagResolutionFailed` remove the tag
from the set when its attempt completes, and `HandleUnsubscribe`/`ReSubscribeAll`
clear stale entries. This prevents overlapping duplicate subscribe attempts and the
resulting orphaned monitored items. Regression test
`DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes` gives a tag a
genuine initial failure then a retry `SubscribeAsync` that never completes, lets six
100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total
subscribe calls); the pre-fix code dispatched on every tick (6 total).
### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` |
**Description**
On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`,
fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the
`Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches
`Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads
have not yet fully torn down could still deliver a value, mixing pre-failover device
data with the new endpoint's data and briefly reporting a value the active endpoint
never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to
distinguish current from stale callbacks.
**Verification note**: Confirmed against source — `TagValueReceived` carried no
adapter identity, and `HandleTagValueReceived` (reachable in `Connected`) processed
any such message regardless of which adapter produced it.
**Recommendation**
Add an adapter-generation counter incremented on every adapter swap; stamp it onto
`TagValueReceived` (captured in the callback closure) and drop messages whose
generation does not match the current adapter in `HandleTagValueReceived`.
**Resolution**
Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new
`_adapterGeneration` `int` field is incremented at both adapter-swap sites (the
unstable-disconnect failover in `BecomeReconnecting` and the connect-failure failover
in `HandleReconnectResult`). The `TagValueReceived` record gained an
`AdapterGeneration` field; every subscription callback closure (`HandleSubscribe`, the
initial-read seed, `HandleRetryTagResolution`, `ReSubscribeAll`) captures the
generation in effect at subscribe time and stamps it onto each `TagValueReceived`.
`HandleTagValueReceived` drops any message whose generation no longer matches the
current adapter, so a callback fired by a disposed adapter after failover cannot reach
an Instance Actor. Regression test
`DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover` subscribes on the
primary, fails over to the backup, then invokes the captured primary callback with a
stale value and asserts the subscriber receives nothing; the stale value reached the
subscriber against the pre-fix code and is dropped after.
### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate
| | |
|--|--|
| Severity | Medium — full secure default also requires a Commons + design-doc change outside this module |
| Category | Security |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` |
**Description**
`OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and
`RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true`
when it is set. With the default, every server certificate is accepted unconditionally
— there is no certificate-pinning or trust-store enforcement — which defeats the
`Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC
UA link. The design doc explicitly lists `true` as the default. For an industrial
control link this is a meaningful exposure; a secure-by-default posture would reject
untrusted certs unless an operator opts in per connection.
**Verification note**: Confirmed against source. Note the *authoritative* runtime
default does not actually live on `OpcUaConnectionOptions` — for a real connection
`OpcUaDataConnection.ConnectAsync` builds `OpcUaConnectionOptions` from
`OpcUaEndpointConfig` (in `ScadaLink.Commons`), whose `AutoAcceptUntrustedCerts`
property also defaults to `true`. `OpcUaConnectionOptions`' own default is only the
fallback used when an `OpcUaConnectionOptions` is constructed directly.
**Recommendation**
Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection
opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
installed. Update the design doc to reflect the secure default.
**Resolution**
Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation
are done: (1) `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` now defaults to
`false`; (2) `RealOpcUaClient.ConnectAsync` logs a prominent `ILogger` warning
whenever the auto-accept certificate validator is installed (an `ILogger<RealOpcUaClient>`
was added as an optional constructor parameter, defaulting to `NullLogger`, so
existing callers are unaffected). Regression test
`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse` guards the
new secure default. **Two parts remain outside this module's editable scope and are
surfaced as action required:** (a) `ScadaLink.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts`
still defaults to `true` — since that is the value actually used for a real connection
(see verification note above), the Commons default must also be flipped to `false`
for the system to be secure-by-default; (b) `docs/requirements/Component-DataConnectionLayer.md`
line 116 still documents `true` as the default and must be updated. Both edits were
deliberately not made here because this task is scoped to
`src/ScadaLink.DataConnectionLayer`, tests, and this findings file only.
### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` |
**Description**
The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers
the event." The implementation is a non-atomic check-then-set on a `volatile bool`
(`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees
visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
`OnClientConnectionLost` and a `ReadAsync` failure path) can both observe
`_disconnectFired == false` and both invoke `Disconnected`. In practice the
`DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact
is low, but the comment overstates the guarantee. The same pattern exists in
`RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`).
**Recommendation**
Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock),
or correct the comment to say "best-effort once-only; a duplicate event is possible
under a race and is tolerated downstream."
**Resolution**
_Unresolved._