f936f55f51
CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the at-least-once livelock on the site→central notification handoff. DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent same-tag subscribes don't orphan an adapter handle; delete the latent dead _subscriptionHandles dictionary; stop double-counting _totalSubscribed when an unresolved tag is promoted via another instance; release adapter handles on mid-flight unsubscribe; gate the tag-resolution retry timer with IsTimerActive so subscribe bursts don't reset it into starvation. SR-020: add _terminatingActorsByName shadow so a third deploy arriving during a pending redeploy doesn't crash on InvalidActorNameException — displaced senders get a Failed/superseded response and the latest command wins on Terminated. SR-024: split OperationTrackingStore reads from writes (fresh SqliteConnection per GetStatusAsync) so long writes don't block status queries; rewrite Dispose to drop the sync-over-async bridge that could deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the dispose-once flag race-safe across both paths.
1270 lines
72 KiB
Markdown
1270 lines
72 KiB
Markdown
# Code Review — DataConnectionLayer
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.DataConnectionLayer` |
|
||
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-28 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1eb6e97` |
|
||
| Open findings | 5 |
|
||
|
||
## Summary
|
||
|
||
The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
|
||
lifecycle state machine, the captured-`Self` marshalling of background-thread
|
||
disconnect events, and the protocol-factory abstraction all follow the design doc
|
||
and Akka.NET conventions. However, the review found one **critical** actor-model
|
||
violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private
|
||
dictionaries and counters from a thread-pool thread, racing with the actor's own
|
||
message loop. Several **high**-severity issues cluster around concurrency and error
|
||
handling: the subscription-failure path leaves the connection with degraded subtrees
|
||
but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops
|
||
all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored-
|
||
item callback dictionary is mutated without synchronization while OPC UA notification
|
||
threads read it. The remaining findings concern stale health counters after failover,
|
||
an unused `WriteTimeout` option (writes are unbounded despite the design promising a
|
||
30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between
|
||
the design doc's failover state machine and the implemented unstable-disconnect
|
||
heuristic. Test coverage is adequate for the happy paths and failover but absent for
|
||
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
|
||
|
||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||
|
||
The 2026-05-28 re-review walked all 10 checklist categories against the current
|
||
source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
|
||
fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
|
||
initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
|
||
were verified in place. The new findings cluster around `HandleSubscribe` /
|
||
`HandleSubscribeCompleted` race-induced state drift:
|
||
|
||
- **High** — concurrent subscribes for the same tag from different instances each see
|
||
the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
|
||
the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
|
||
later `HandleSubscribeCompleted` silently discards the second adapter subscription
|
||
handle — the orphan never gets `UnsubscribeAsync`'d.
|
||
- **Medium** — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
|
||
mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
|
||
`DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
|
||
`RealOpcUaClient` but missed in the layer above.
|
||
- **Medium** — `HandleSubscribeCompleted`'s success branch never checks
|
||
`_unresolvedTags`, so a tag that previously failed resolution (incrementing
|
||
`_totalSubscribed`) and is then successfully subscribed by a different instance gets
|
||
`_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
|
||
lingers until the retry timer also resolves it, creating an orphaned monitored item.
|
||
- **Medium** — when an instance is unsubscribed mid-flight,
|
||
`HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
|
||
entry and processes the late results, leaking `_tagSubscriberCount` /
|
||
`_totalSubscribed` / `_resolvedTags` increments for an instance with no
|
||
`_subscribers` entry to deliver values to.
|
||
- **Medium** — `HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
|
||
completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
|
||
same key cancels and replaces the existing timer, so a burst of subscribes arriving
|
||
faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
|
||
and the retry never actually fires.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
|
||
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
|
||
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
|
||
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
|
||
and found **4 new findings**: one **High** — the DCL-012 security warning is never
|
||
seen in production because `RealOpcUaClientFactory.Create()` constructs
|
||
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
|
||
**Medium** — initial-connect failures in the `Connecting` state never count toward
|
||
failover, so a connection whose primary endpoint is unreachable at startup retries the
|
||
primary forever and never tries the configured backup; one **Medium** —
|
||
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
|
||
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
|
||
telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
|
||
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
|
||
so a mid-batch disconnect aborts the whole write batch (the same class of defect
|
||
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
|
||
`DataConnectionLayer-014`.
|
||
|
||
## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
|
||
| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
|
||
| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
|
||
| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
|
||
| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
|
||
| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
|
||
| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
|
||
| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
|
||
| 9 | Testing coverage | x | DCL001–017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
|
||
| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
|
||
|
||
## Checklist coverage (2026-05-17 re-review, commit `39d737e`)
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
|
||
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
|
||
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
|
||
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
|
||
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
|
||
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
|
||
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
|
||
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
|
||
| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
|
||
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
|
||
|
||
## Findings
|
||
|
||
### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Critical |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` |
|
||
|
||
**Description**
|
||
|
||
`HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool
|
||
thread and directly mutates the actor's private mutable state: `instanceTags` (a
|
||
reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`,
|
||
`_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and
|
||
written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`,
|
||
`ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a
|
||
direct violation of the Akka.NET actor model, which guarantees single-threaded access
|
||
to actor state only when state is touched on the actor thread. Two concurrent
|
||
subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`,
|
||
produce data races on `Dictionary`/`HashSet`/`int` — `Dictionary` is not thread-safe
|
||
and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
|
||
also produce torn reads of the health counters.
|
||
|
||
**Recommendation**
|
||
|
||
Do not mutate actor state from the background task. Perform only the `await
|
||
_adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results
|
||
into a local immutable result object, and `PipeTo(Self)` an internal message (e.g.
|
||
`SubscribeCompleted`) whose handler — running on the actor thread — applies all state
|
||
mutations and counter updates. The response to `Sender` should be sent from that
|
||
handler too.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `HandleSubscribe` was restructured to follow the actor's own
|
||
`PipeTo(Self)` pattern (the one already used by `HandleRetryTagResolution`): the
|
||
background `Task.Run` now performs only adapter I/O (`SubscribeAsync`/`ReadAsync`),
|
||
collects per-tag outcomes into an immutable `SubscribeCompleted` message, and pipes
|
||
that to `Self`. All mutation of `_subscriptionIds`, `_subscriptionsByInstance`,
|
||
`_totalSubscribed`, `_resolvedTags` and `_unresolvedTags` now happens in the new
|
||
`HandleSubscribeCompleted` handler on the actor thread; it is wired into the
|
||
Connected, Connecting and Reconnecting states so an in-flight subscribe is applied
|
||
regardless of state transitions. Regression test
|
||
`DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters` (30×30 concurrent
|
||
subscribes) fails against the pre-fix code and passes after. Fixed by the commit
|
||
whose message references `DataConnectionLayer-001`.
|
||
|
||
### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
|
||
|
||
**Description**
|
||
|
||
`DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with
|
||
`Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a
|
||
fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`,
|
||
`_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are
|
||
silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the
|
||
design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
|
||
that had subscribed before the crash never get their tags re-subscribed and will sit
|
||
at uncertain/stale quality indefinitely with no error returned. There is no durable
|
||
subscription store from which a restarted actor could rebuild state.
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) make the subscription registry durable/recoverable so a restarted actor
|
||
can rebuild it (persist to local SQLite as the design doc says connection definitions
|
||
are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash
|
||
as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the
|
||
subscription registrations. At minimum document that subscribers must re-register
|
||
after a crash and surface the lost-state condition rather than failing silently.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed
|
||
from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures.
|
||
`Resume` keeps the existing actor instance and all its in-memory subscription state
|
||
(`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters)
|
||
intact across a transient handler exception, so the design doc's "transparent
|
||
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
|
||
coordinator and its own Become/Stash reconnect state machine already recovers
|
||
connection-level faults — it does not need a restart. This also aligns with the
|
||
ScadaLink convention of `Resume` for coordinator actors. Regression test
|
||
`DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor
|
||
via a synchronously-throwing write and asserts the subscription survives (health
|
||
report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart`
|
||
code and passes after. Fixed by the commit whose message references
|
||
`DataConnectionLayer-002` (commit `<pending>`).
|
||
|
||
### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
|
||
|
||
**Description**
|
||
|
||
`_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are
|
||
written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the
|
||
`DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool
|
||
threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from
|
||
the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on
|
||
the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line
|
||
163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined
|
||
behaviour — they can throw `InvalidOperationException`, return wrong entries, or
|
||
corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off
|
||
the actor thread (finding 001), so multiple subscribe calls can also race each other
|
||
here.
|
||
|
||
**Recommendation**
|
||
|
||
Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all
|
||
access with a lock. Note that fixing finding 001 (serialising subscribe through the
|
||
actor thread) reduces but does not eliminate the race, because the SDK notification
|
||
threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
|
||
`DisconnectAsync`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were
|
||
changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two
|
||
`Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read
|
||
from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading
|
||
`_callbacks`) concurrently with subscribe/disconnect mutations on other threads.
|
||
`RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`,
|
||
`Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA
|
||
server, so the regression is guarded structurally by
|
||
`DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting
|
||
both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary`
|
||
code and passes after. Fixed by the commit whose message references
|
||
`DataConnectionLayer-003` (commit `<pending>`).
|
||
|
||
### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
|
||
|
||
**Description**
|
||
|
||
When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task,
|
||
the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`,
|
||
treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync`
|
||
also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA
|
||
client is not connected, and throws on transport faults — these are connection
|
||
problems, not bad tag paths. They get misclassified as unresolved tags and retried on
|
||
the 10 s tag-resolution timer instead of triggering the reconnection state machine.
|
||
Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
|
||
must be marked quality `bad`; the code never pushes a bad-quality update to the
|
||
subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
|
||
stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends
|
||
to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`).
|
||
|
||
**Recommendation**
|
||
|
||
Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the
|
||
reconnect machine handle them) from genuine node-not-found errors. For genuine
|
||
resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing
|
||
Instance Actor so it reflects the documented behaviour.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. The subscribe background task now classifies each subscribe
|
||
exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException`
|
||
— thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/
|
||
`IOException` count as connection-level; anything else is a genuine resolution
|
||
failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure`
|
||
and applied on the actor thread in `HandleSubscribeCompleted`: connection-level
|
||
failures no longer become unresolved tags and instead drive the reconnection state
|
||
machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler
|
||
calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags`
|
||
and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to
|
||
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
|
||
Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber`
|
||
and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail
|
||
against the pre-fix code and pass after. Fixed by the commit whose message references
|
||
`DataConnectionLayer-004` (commit `<pending>`).
|
||
|
||
### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
|
||
|
||
**Description**
|
||
|
||
`DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared
|
||
Settings" table both promise a bounded timeout for synchronous device writes. The
|
||
value is never read anywhere in the module (`grep` confirms only the declaration).
|
||
`HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no
|
||
`CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no
|
||
RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling
|
||
script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
|
||
error. The design states write failures (including timeout) must be returned
|
||
synchronously to the script; an unbounded write violates that.
|
||
|
||
**Recommendation**
|
||
|
||
Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to
|
||
`WriteAsync`, and in the continuation translate cancellation into a failed
|
||
`WriteTagResponse` with a timeout error message. Apply the same to the read used by
|
||
the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`,
|
||
passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the
|
||
continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base
|
||
`OperationCanceledException`) is translated into a failed `WriteTagResponse` with a
|
||
`"Write timeout after Ns"` message, so a hung device write is bounded and the failure
|
||
is returned synchronously to the calling script (WP-11) instead of blocking until the
|
||
script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts
|
||
an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no
|
||
change was needed there.) Regression test
|
||
`DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose
|
||
`WriteAsync` only completes when its token fires; it fails against the pre-fix
|
||
unbounded code and passes after. Fixed by the commit whose message references
|
||
`DataConnectionLayer-005` (commit `<pending>`).
|
||
|
||
### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` |
|
||
|
||
**Description**
|
||
|
||
`ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a
|
||
clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and
|
||
`_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect)
|
||
sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a
|
||
reconnect, `HandleTagValueReceived` decrements the *old* bucket using
|
||
`_lastTagQuality`'s value and increments the new one — but tags resolved for the first
|
||
time after reconnect were never in `_lastTagQuality`, so they only increment, never
|
||
decrement, and the totals can drift above `_totalSubscribed`. Over repeated
|
||
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
|
||
unreliable.
|
||
|
||
**Verification note**: Confirmed against source. The root cause is broader than the
|
||
reconnect path the finding describes: `HandleUnsubscribe` also never removes a tag
|
||
from `_lastTagQuality` nor decrements its quality bucket, so an unsubscribed tag
|
||
lingers and `PushBadQualityForAllTags` (which sets `_tagsBadQuality =
|
||
_lastTagQuality.Count`) over-counts it — driving the bad-quality count above
|
||
`_totalSubscribed` even without a re-subscribe. Both the unsubscribe leak and the
|
||
re-subscribe drift are real.
|
||
|
||
**Recommendation**
|
||
|
||
On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear
|
||
`_lastTagQuality` and the three quality counters and let them be repopulated from
|
||
fresh `TagValueReceived` messages. Alternatively recompute the buckets from
|
||
`_lastTagQuality` whenever it changes rather than maintaining incremental counters.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `HandleUnsubscribe` now removes each
|
||
unsubscribed tag from `_lastTagQuality` and decrements the corresponding quality
|
||
bucket, then reports the corrected counters via `UpdateTagQuality`/`UpdateTagResolution`;
|
||
`ReSubscribeAll` clears `_lastTagQuality` and zeroes the three quality counters so
|
||
post-reconnect tags are repopulated from fresh `TagValueReceived` messages instead of
|
||
only incrementing. Regression test
|
||
`DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags` subscribes two
|
||
tags, pushes Good values, unsubscribes one, then disconnects and asserts
|
||
`PushBadQualityForAllTags` reports exactly 1 bad tag (the reconnect is gated open so
|
||
`ReSubscribeAll` does not run before the assertion); it reports 2 against the pre-fix
|
||
code and 1 after.
|
||
|
||
### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` |
|
||
|
||
**Description**
|
||
|
||
`ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any
|
||
non-cancellation exception (line 184). So if any single tag in the batch throws (bad
|
||
node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no
|
||
results for the tags that *did* read successfully — even though `ReadResult` already
|
||
has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is
|
||
also fully serial (one round-trip per tag), defeating the point of a batch API; the
|
||
design doc lists `ReadBatch`/`WriteBatch` as first-class operations.
|
||
|
||
**Verification note**: Confirmed against source — `ReadAsync` re-throws on any
|
||
non-`OperationCanceledException`, aborting the whole batch.
|
||
|
||
**Recommendation**
|
||
|
||
Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag
|
||
so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call
|
||
for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a
|
||
`ReadValueIdCollection` — extend it to accept multiple nodes).
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `ReadBatchAsync` now wraps each per-tag
|
||
`ReadAsync` in a try/catch: a per-tag exception is recorded as a failed `ReadResult`
|
||
(`Success: false`, message = the exception message) so the batch returns a complete
|
||
result map for every requested tag; `OperationCanceledException` is still propagated
|
||
so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call
|
||
optimisation were deliberately left for a follow-up — they are a performance concern,
|
||
not the correctness bug this finding raised. Regression test
|
||
`DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails` reads three tags where the
|
||
middle one throws and asserts all three appear in the result map with the failing one
|
||
marked unsuccessful; it threw (no map returned) against the pre-fix code and passes
|
||
after.
|
||
|
||
### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` |
|
||
|
||
**Description**
|
||
|
||
For each tag of the instance being removed, `HandleUnsubscribe` scans every other
|
||
instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation
|
||
O(tags x instances). On a site with many instances sharing a connection this is
|
||
needlessly expensive on every instance stop/redeploy. Separately, line 562
|
||
re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already
|
||
removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead
|
||
logic that obscures intent (the decrement of `_resolvedTags` is unconditional in
|
||
practice).
|
||
|
||
**Recommendation**
|
||
|
||
Maintain a reference count per tag path (or a `tagPath -> set<instance>` reverse index)
|
||
so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags`
|
||
re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
|
||
prior state captured before removal.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). A `tagPath → subscriber-count` reverse index
|
||
(`_tagSubscriberCount`) was added: `HandleSubscribeCompleted` increments it whenever a
|
||
tag is newly added to an instance's set, and `HandleUnsubscribe` decrements it,
|
||
releasing a tag at the adapter only when the count reaches zero. The "any other
|
||
subscriber" check is now O(1) per tag instead of an O(instances) `Where(...).Any()`
|
||
scan. The redundant `!_unresolvedTags.Contains(tagPath)` re-check (always true after
|
||
the unconditional `Remove` on the line above) was removed — the surviving branch is
|
||
entered only for tags that have a subscription id, which are by definition resolved,
|
||
so `_resolvedTags--` is now unconditional with an explanatory comment. The cleanup
|
||
also fixed a latent leak the original code could not reach: an unresolved tag whose
|
||
last subscriber unsubscribes is now removed from `_unresolvedTags`/`_resolutionInFlight`
|
||
and decremented from `_totalSubscribed` (previously it lingered in the retry timer and
|
||
the subscribed total forever). Regression test
|
||
`DCL008_Unsubscribe_OnlyReleasesTagWhenLastSubscriberLeaves` subscribes a tag from two
|
||
instances plus an exclusive tag, then unsubscribes each instance and asserts the
|
||
shared tag is released at the adapter only after the last subscriber leaves and the
|
||
health counters track correctly. (This finding is a performance refactor, not a
|
||
correctness bug — the pre-fix `Where(...).Any()` logic was functionally correct, so
|
||
the test passes against both versions and serves as a behavioural guard for the
|
||
refactor.)
|
||
|
||
### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — partially design-doc work outside this module's editable scope |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` |
|
||
|
||
**Description**
|
||
|
||
The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
|
||
(>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate*
|
||
failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on
|
||
connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally
|
||
counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped
|
||
within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count
|
||
too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
|
||
happen on *successful-but-flaky* connections are not described in the component doc at
|
||
all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not
|
||
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
|
||
`DataConnectionOptions`.
|
||
|
||
**Verification note**: Confirmed against source. The hard-coded
|
||
`StableConnectionThreshold = TimeSpan.FromSeconds(60)` `static readonly` field and the
|
||
`_consecutiveUnstableDisconnects` failover path both exist as described.
|
||
|
||
**Recommendation**
|
||
|
||
Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover
|
||
path and the stability threshold, and move the 60 s threshold into
|
||
`DataConnectionOptions` so it is configurable and consistent with the other tunables.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). The configurability half of the recommendation
|
||
is done: the hard-coded `StableConnectionThreshold` constant was removed from
|
||
`DataConnectionActor` and replaced with a new `DataConnectionOptions.StableConnectionThreshold`
|
||
property (60 s default), bindable from the `DataConnectionLayer` `appsettings.json`
|
||
section like `ReconnectInterval`/`TagResolutionRetryInterval`/`WriteTimeout`. Regression
|
||
test `DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault` guards
|
||
the default and the setter. **The documentation half is out of this module's editable
|
||
scope** — `docs/requirements/Component-DataConnectionLayer.md` (lines 73-85) still
|
||
describes only the connect-failure failover path and does not mention the
|
||
unstable-disconnect trigger. **Action required (surfaced):** the DCL design doc should
|
||
be updated to document the unstable-disconnect failover path and the configurable
|
||
stability threshold; that edit was deliberately not made here because this task is
|
||
scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only.
|
||
|
||
### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` |
|
||
|
||
**Description**
|
||
|
||
`HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags`
|
||
via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from
|
||
`_unresolvedTags` while the attempts are in flight. Because tags are not removed
|
||
before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues
|
||
duplicate concurrent subscribe attempts for the same tag, which can create duplicate
|
||
monitored items / leaked subscription IDs (the second success overwrites
|
||
`_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle
|
||
with no `UnsubscribeAsync` call). The timer-cancel condition in
|
||
`HandleTagResolutionSucceeded` is also non-deterministic for the same reason.
|
||
|
||
**Verification note**: Confirmed against source — `HandleRetryTagResolution` dispatched
|
||
`SubscribeAsync` for every tag in `_unresolvedTags` on every tick with no in-flight
|
||
guard.
|
||
|
||
**Recommendation**
|
||
|
||
Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is
|
||
dispatched, and only put them back on failure. This prevents overlapping duplicate
|
||
subscribe attempts and makes the timer-cancel condition deterministic.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). A new `_resolutionInFlight` `HashSet<string>`
|
||
tracks tags whose retry `SubscribeAsync` is currently outstanding.
|
||
`HandleRetryTagResolution` now dispatches only for unresolved tags **not** already in
|
||
flight (and skips entirely if all are in flight), adding each dispatched tag to the
|
||
set; `HandleTagResolutionSucceeded` and `HandleTagResolutionFailed` remove the tag
|
||
from the set when its attempt completes, and `HandleUnsubscribe`/`ReSubscribeAll`
|
||
clear stale entries. This prevents overlapping duplicate subscribe attempts and the
|
||
resulting orphaned monitored items. Regression test
|
||
`DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes` gives a tag a
|
||
genuine initial failure then a retry `SubscribeAsync` that never completes, lets six
|
||
100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total
|
||
subscribe calls); the pre-fix code dispatched on every tick (6 total).
|
||
|
||
### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` |
|
||
|
||
**Description**
|
||
|
||
On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`,
|
||
fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
|
||
captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the
|
||
`Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches
|
||
`Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads
|
||
have not yet fully torn down could still deliver a value, mixing pre-failover device
|
||
data with the new endpoint's data and briefly reporting a value the active endpoint
|
||
never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to
|
||
distinguish current from stale callbacks.
|
||
|
||
**Verification note**: Confirmed against source — `TagValueReceived` carried no
|
||
adapter identity, and `HandleTagValueReceived` (reachable in `Connected`) processed
|
||
any such message regardless of which adapter produced it.
|
||
|
||
**Recommendation**
|
||
|
||
Add an adapter-generation counter incremented on every adapter swap; stamp it onto
|
||
`TagValueReceived` (captured in the callback closure) and drop messages whose
|
||
generation does not match the current adapter in `HandleTagValueReceived`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new
|
||
`_adapterGeneration` `int` field is incremented at both adapter-swap sites (the
|
||
unstable-disconnect failover in `BecomeReconnecting` and the connect-failure failover
|
||
in `HandleReconnectResult`). The `TagValueReceived` record gained an
|
||
`AdapterGeneration` field; every subscription callback closure (`HandleSubscribe`, the
|
||
initial-read seed, `HandleRetryTagResolution`, `ReSubscribeAll`) captures the
|
||
generation in effect at subscribe time and stamps it onto each `TagValueReceived`.
|
||
`HandleTagValueReceived` drops any message whose generation no longer matches the
|
||
current adapter, so a callback fired by a disposed adapter after failover cannot reach
|
||
an Instance Actor. Regression test
|
||
`DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover` subscribes on the
|
||
primary, fails over to the backup, then invokes the captured primary callback with a
|
||
stale value and asserts the subscriber receives nothing; the stale value reached the
|
||
subscriber against the pre-fix code and is dropped after.
|
||
|
||
### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — full secure default also requires a Commons + design-doc change outside this module |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` |
|
||
|
||
**Description**
|
||
|
||
`OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and
|
||
`RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true`
|
||
when it is set. With the default, every server certificate is accepted unconditionally
|
||
— there is no certificate-pinning or trust-store enforcement — which defeats the
|
||
`Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC
|
||
UA link. The design doc explicitly lists `true` as the default. For an industrial
|
||
control link this is a meaningful exposure; a secure-by-default posture would reject
|
||
untrusted certs unless an operator opts in per connection.
|
||
|
||
**Verification note**: Confirmed against source. Note the *authoritative* runtime
|
||
default does not actually live on `OpcUaConnectionOptions` — for a real connection
|
||
`OpcUaDataConnection.ConnectAsync` builds `OpcUaConnectionOptions` from
|
||
`OpcUaEndpointConfig` (in `ScadaLink.Commons`), whose `AutoAcceptUntrustedCerts`
|
||
property also defaults to `true`. `OpcUaConnectionOptions`' own default is only the
|
||
fallback used when an `OpcUaConnectionOptions` is constructed directly.
|
||
|
||
**Recommendation**
|
||
|
||
Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection
|
||
opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
|
||
installed. Update the design doc to reflect the secure default.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation
|
||
are done: (1) `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` now defaults to
|
||
`false`; (2) `RealOpcUaClient.ConnectAsync` logs a prominent `ILogger` warning
|
||
whenever the auto-accept certificate validator is installed (an `ILogger<RealOpcUaClient>`
|
||
was added as an optional constructor parameter, defaulting to `NullLogger`, so
|
||
existing callers are unaffected). Regression test
|
||
`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse` guards the
|
||
new secure default. **Two parts remain outside this module's editable scope and are
|
||
surfaced as action required:** (a) `ScadaLink.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts`
|
||
still defaults to `true` — since that is the value actually used for a real connection
|
||
(see verification note above), the Commons default must also be flipped to `false`
|
||
for the system to be secure-by-default; (b) `docs/requirements/Component-DataConnectionLayer.md`
|
||
line 116 still documents `true` as the default and must be updated. Both edits were
|
||
deliberately not made here because this task is scoped to
|
||
`src/ScadaLink.DataConnectionLayer`, tests, and this findings file only.
|
||
|
||
### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` |
|
||
|
||
**Description**
|
||
|
||
The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers
|
||
the event." The implementation is a non-atomic check-then-set on a `volatile bool`
|
||
(`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees
|
||
visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
|
||
`OnClientConnectionLost` and a `ReadAsync` failure path) can both observe
|
||
`_disconnectFired == false` and both invoke `Disconnected`. In practice the
|
||
`DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact
|
||
is low, but the comment overstates the guarantee. The same pattern exists in
|
||
`RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`).
|
||
|
||
**Recommendation**
|
||
|
||
Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock),
|
||
or correct the comment to say "best-effort once-only; a duplicate event is possible
|
||
under a race and is tolerated downstream."
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Rather than weaken the XML comment to match the
|
||
weak guard, the guard was made genuinely atomic so the documented "only the first
|
||
caller fires the event" guarantee becomes true. `OpcUaDataConnection._disconnectFired`
|
||
and `RealOpcUaClient._connectionLostFired` were changed from `volatile bool` to `int`,
|
||
and the check-then-set in `RaiseDisconnected` / `OnSessionKeepAlive` replaced with a
|
||
single `Interlocked.Exchange(ref flag, 1) != 0` compare-and-set; the reset on connect
|
||
uses `Interlocked.Exchange(ref flag, 0)`. The XML comments on both methods were updated
|
||
to describe the atomic compare-and-set explicitly. Regression test
|
||
`DCL013_ConcurrentConnectionLost_RaisesDisconnectedExactlyOnce` runs 25 rounds, each
|
||
fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost` event
|
||
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
|
||
non-atomic check-then-set it double-fires (verified by temporarily reverting the
|
||
guard), and it passes against the atomic fix.
|
||
|
||
### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
|
||
|
||
**Description**
|
||
|
||
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
|
||
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
|
||
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
|
||
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
|
||
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
|
||
|
||
However, the only production code path that constructs a `RealOpcUaClient` is
|
||
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
|
||
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
|
||
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
|
||
As a result the `_logger` field is always `NullLogger` for every real OPC UA
|
||
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
|
||
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
|
||
enabled — accepting any server certificate on an industrial control link — gets no
|
||
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
|
||
therefore not actually effective in production; only the unit test
|
||
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
|
||
checks the default value) passes.
|
||
|
||
**Recommendation**
|
||
|
||
Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
|
||
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
|
||
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
|
||
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
|
||
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
|
||
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
|
||
the default is `false`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit pending). `RealOpcUaClientFactory` gained an
|
||
`ILoggerFactory` constructor parameter and `Create()` now passes
|
||
`_loggerFactory.CreateLogger<RealOpcUaClient>()` into every `RealOpcUaClient` it builds;
|
||
`DataConnectionFactory` (which already holds an `ILoggerFactory`) now constructs
|
||
`RealOpcUaClientFactory(globalOptions, _loggerFactory)`, so the DCL-012 auto-accept-cert
|
||
warning reaches a real logger in production instead of being discarded by `NullLogger`.
|
||
The parameterless / single-arg factory constructors still default to `NullLoggerFactory`
|
||
so existing callers are unaffected. Regression tests
|
||
`DCL014_RealOpcUaClientFactory_CreatesClientWithRealLogger` and
|
||
`DCL014_DataConnectionFactory_ThreadsLoggerToRealOpcUaClient` fail against the pre-fix
|
||
code (the first fails to compile — no logger ctor; the second observes a `NullLogger`)
|
||
and pass after.
|
||
|
||
### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
|
||
|
||
**Description**
|
||
|
||
Failover between the primary and backup endpoints is implemented in two places, both
|
||
reachable only after a connection has already been `Connected` at least once:
|
||
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
|
||
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
|
||
|
||
`HandleConnectResult` — the handler for the *initial* connection attempt in the
|
||
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
|
||
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
|
||
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
|
||
|
||
Consequence: if the primary endpoint is unreachable when the connection actor first
|
||
starts — which is the common case after a fresh artifact deployment, a site restart,
|
||
or a primary that is simply down at that moment — the actor retries the *primary*
|
||
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
|
||
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
|
||
active endpoint becomes unreachable") is silently not honoured for the
|
||
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
|
||
despite a healthy backup being configured.
|
||
|
||
**Recommendation**
|
||
|
||
Make `HandleConnectResult` participate in the failover counter the same way
|
||
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
|
||
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
|
||
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
|
||
log the failover event) before re-arming the timer. Alternatively, fold the initial
|
||
connect into the same reconnect path so there is a single failover decision point. Add
|
||
a regression test for "primary down at startup, backup configured → fails over to
|
||
backup".
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit pending). The endpoint-switch failover logic was extracted
|
||
from `HandleReconnectResult` into a shared `CountFailureAndMaybeFailover` helper, and
|
||
`HandleConnectResult` (the initial-connect handler in the `Connecting` state) now
|
||
increments `_consecutiveFailures` and calls that helper on failure — so a primary that
|
||
is unreachable at startup fails over to the configured backup after
|
||
`FailoverRetryCount` attempts instead of retrying the primary forever. Single-endpoint
|
||
connections (no backup) still retry indefinitely. Regression test
|
||
`DCL015_PrimaryDownAtStartup_FailsOverToBackup` fails against the pre-fix code (the
|
||
backup adapter is never created) and passes after; `DCL015_SingleEndpointDownAtStartup_RetriesIndefinitely_NoFailover`
|
||
guards the no-backup path.
|
||
|
||
### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
|
||
|
||
**Description**
|
||
|
||
`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
|
||
so the `Connected`-state handler can drive the actor into `Reconnecting`
|
||
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
|
||
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
|
||
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
|
||
level.
|
||
|
||
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
|
||
told the subscribe **succeeded**, while the connection actor simultaneously transitions
|
||
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
|
||
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
|
||
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
|
||
healthy" from "accepted, but the connection is currently down" — a misleading
|
||
success signal on a request that did not do what the response claims.
|
||
|
||
(Genuine tag-resolution failures are arguably also reported as overall `true`, but
|
||
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
|
||
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
|
||
already pushed. The connection-level case is the clear defect because the actor itself
|
||
treats it as a failure worth a state transition.)
|
||
|
||
**Recommendation**
|
||
|
||
When `connectionLevelFailure` is true, reply with
|
||
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
|
||
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
|
||
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
|
||
Instance Actor can reflect partial success. Add a test asserting the response is not
|
||
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit pending). `HandleSubscribeCompleted` now replies
|
||
`SubscribeTagsResponse(success: false, error: "connection unavailable — will
|
||
re-subscribe on reconnect")` when `connectionLevelFailure` is true — matching the
|
||
actor's own assessment as it drives into `Reconnecting` — instead of unconditionally
|
||
replying `success: true`. Genuine tag-resolution failures still reply `success: true`
|
||
(they are a runtime quality concern tracked via `_unresolvedTags`, with a `Bad`-quality
|
||
`TagValueUpdate` already pushed), so that case is not regressed. Regression test
|
||
`DCL016_ConnectionLevelSubscribeFailure_RepliesWithUnsuccessfulResponse` fails against
|
||
the pre-fix code (it returned `Success: true`) and passes after;
|
||
`DCL016_GenuineResolutionFailure_StillRepliesSuccess` guards the resolution-failure path.
|
||
|
||
### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
|
||
|
||
**Description**
|
||
|
||
`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
|
||
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
|
||
status does not abort the batch), but it first calls `EnsureConnected()`
|
||
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
|
||
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
|
||
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
|
||
caller gets no result map — losing the per-tag outcomes for the tags that already
|
||
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
|
||
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
|
||
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
|
||
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
|
||
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
|
||
|
||
Severity is Low because device writes are real-time control operations with no
|
||
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
|
||
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
|
||
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
|
||
maintainability and correctness wart.
|
||
|
||
**Recommendation**
|
||
|
||
Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
|
||
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
|
||
and continues, while still propagating `OperationCanceledException` to abort a
|
||
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
|
||
complete, consistent result map.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17 (commit pending). `WriteBatchAsync` now wraps each per-tag
|
||
`WriteAsync` call in a try/catch — mirroring the DCL-007 fix for `ReadBatchAsync` — so a
|
||
mid-batch fault (the `InvalidOperationException` from `EnsureConnected` on a dropped
|
||
connection) is recorded as a failed `WriteResult(false, ex.Message)` for that tag and
|
||
the batch returns a complete result map; `OperationCanceledException` is still
|
||
propagated so a cancelled batch aborts as a whole. Callers (including
|
||
`WriteBatchAndWaitAsync`) now get a consistent per-tag result shape instead of an
|
||
unhandled exception. Regression test
|
||
`DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
|
||
pre-fix code (the batch throws, no map returned) and passes after;
|
||
`DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.
|
||
|
||
### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
|
||
|
||
**Resolution** — added a `_subscribesInFlight` HashSet mirroring the
|
||
existing `_resolutionInFlight` pattern. `HandleSubscribe` now partitions
|
||
each request's tags on the actor thread into "this request will
|
||
SubscribeAsync" vs. "already subscribed by us OR by another in-flight
|
||
request"; the second arrival sees the tag in `_subscribesInFlight` and
|
||
treats it as `AlreadySubscribed: true` without issuing a duplicate
|
||
adapter call. `HandleSubscribeCompleted` removes each
|
||
non-AlreadySubscribed result from the set; `ReSubscribeAll` clears it on
|
||
reconnect. Regression test
|
||
`DCL018_ConcurrentSubscribes_SameTag_DifferentInstances_IssueOneAdapterSubscribe`
|
||
parks the first subscribe in flight, fires a second for the same tag,
|
||
and asserts exactly one adapter SubscribeAsync call.
|
||
|
||
**Description**
|
||
|
||
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
|
||
set on the actor thread before dispatching the `Task.Run` that performs the adapter
|
||
I/O (line 557). The snapshot is the only basis on which the background task decides
|
||
whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
|
||
runs.
|
||
|
||
If two `SubscribeTagsRequest` messages arrive on the actor thread for different
|
||
instances that both reference the same tag path, both `HandleSubscribe` invocations
|
||
take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
|
||
does not contain the shared tag in either snapshot. Both background tasks then call
|
||
`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
|
||
and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
|
||
back to the actor with `AlreadySubscribed: false, Success: true`.
|
||
|
||
`HandleSubscribeCompleted` for the first message takes the success branch and writes
|
||
`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
|
||
"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
|
||
and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
|
||
adapter subscription) is silently discarded. The orphan monitored item stays alive in
|
||
the OPC UA session for the lifetime of the adapter, sending duplicate data-change
|
||
notifications (whose callbacks were stamped with the captured `generation`) into
|
||
`HandleTagValueReceived` for every value change. Across a deploy that creates many
|
||
instances sharing a few tags, this leaks N-1 monitored items per shared tag and
|
||
doubles/triples the per-tag publish traffic.
|
||
|
||
DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
|
||
via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
|
||
path.
|
||
|
||
**Recommendation**
|
||
|
||
Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
|
||
`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
|
||
**before** the `Task.Run` dispatch, only for tags not already in
|
||
`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
|
||
in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
|
||
without touching the adapter. Remove from `_subscribesInFlight` in
|
||
`HandleSubscribeCompleted` once the result is applied. Add a regression test that
|
||
fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
|
||
exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).
|
||
|
||
### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
|
||
|
||
**Resolution** — deleted the dead `_subscriptionHandles` field outright.
|
||
Subscription bookkeeping lives in `RealOpcUaClient._monitoredItems` /
|
||
`_callbacks` (already `ConcurrentDictionary` per DCL-003) at the device
|
||
layer and `DataConnectionActor._subscriptionIds` at the actor layer;
|
||
the adapter had no live reader and the field was a latent
|
||
race-condition trap. Added structural regression test
|
||
`DCL019_OpcUaDataConnection_HasNoNonConcurrentSharedDictionary` that
|
||
reflects over the adapter's fields and fails if any plain
|
||
`Dictionary<,>` is reintroduced.
|
||
|
||
**Description**
|
||
|
||
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
|
||
string>`. It is mutated from:
|
||
|
||
- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
|
||
after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
|
||
executes on the continuation thread (a thread-pool thread).
|
||
- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
|
||
similarly after an `await`.
|
||
- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
|
||
**not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
|
||
`UnsubscribeAsync` calls can run in parallel from the upper layer.
|
||
|
||
The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
|
||
run off the actor thread:
|
||
|
||
- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
|
||
can run in parallel — see DCL-018);
|
||
- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
|
||
`_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
|
||
via the SDK's async machinery);
|
||
- `ReSubscribeAll` does the same after a reconnect.
|
||
|
||
So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
|
||
the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
|
||
`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
|
||
concurrent resize are undefined behaviour: they can throw
|
||
`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.
|
||
|
||
This is `_subscriptionHandles` is currently dead state (the dictionary is written to
|
||
and `Remove`d but **never read**), so a corruption today would not crash the
|
||
subscribe path — but the bug is latent and the field will become load-bearing the
|
||
moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
|
||
diagnostics, which is what the dictionary's name suggests it was intended for).
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) change `_subscriptionHandles` to
|
||
`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
|
||
DCL-003's fix one layer down, or (b) delete the field entirely since it is never
|
||
read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
|
||
`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
|
||
race and removes dead state in one stroke. Add a regression test (or extend
|
||
`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
|
||
non-concurrent `Dictionary` field is shared across thread boundaries in adapter
|
||
state.
|
||
|
||
### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
|
||
|
||
**Resolution** — split the success branch into "fresh subscribe" vs
|
||
"unresolved → resolved promotion": `_unresolvedTags.Remove(...)` is now
|
||
called before incrementing the counters; a promotion increments only
|
||
`_resolvedTags` and clears `_resolutionInFlight`, mirroring
|
||
`HandleTagResolutionSucceeded`. The symmetric failure branch was also
|
||
fixed — `_totalSubscribed` only increments when
|
||
`_unresolvedTags.Add(...)` returns `true` so a second instance failing
|
||
to resolve the same tag is a no-op on the counter. Tests:
|
||
`DCL020_UnresolvedTagPromoted_ByDifferentInstance_DoesNotDoubleCountTotalSubscribed`
|
||
and `DCL020_TwoInstancesFailingSameTag_OnlyCountsTagOnceInTotal`.
|
||
|
||
**Description**
|
||
|
||
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
|
||
`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
|
||
_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
|
||
`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
|
||
runs for a tag that previously failed resolution from an earlier instance's subscribe
|
||
(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
|
||
674-676) and is now successfully subscribed by a later instance.
|
||
|
||
Sequence:
|
||
|
||
1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
|
||
exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
|
||
`_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
|
||
2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
|
||
succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
|
||
`_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
|
||
(now `_totalSubscribed = 2`, `_resolvedTags = 1`).
|
||
3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
|
||
`HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
|
||
adapter (creating a **second** monitored item for the same tag), and
|
||
`HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")` →
|
||
`_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
|
||
monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).
|
||
|
||
Net effect:
|
||
|
||
- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
|
||
`_resolvedTags`. During that window the health report's "subscribed / resolved"
|
||
ratio is wrong.
|
||
- Two adapter subscription handles for the same tag are leaked across this race
|
||
(DCL-018's orphan plus the retry's second adapter call); the second leaks
|
||
permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.
|
||
|
||
**Recommendation**
|
||
|
||
In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
|
||
check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
|
||
unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
|
||
`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
|
||
`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
|
||
test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
|
||
"resolve via a second instance" sequence.
|
||
|
||
### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
|
||
|
||
**Resolution** — `HandleSubscribeCompleted` now detects the
|
||
mid-termination race: when `_subscriptionsByInstance.TryGetValue` fails,
|
||
the method logs a warning, clears each owned tag from
|
||
`_subscribesInFlight`, fires fire-and-forget
|
||
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` for every successful
|
||
non-AlreadySubscribed result (so the OPC UA monitored items don't keep
|
||
streaming for a tag nobody listens to), and returns without applying
|
||
counter or handle mutations. Regression test
|
||
`DCL021_UnsubscribeDuringInFlightSubscribe_ReleasesAdapterHandle_AndKeepsStateClean`
|
||
parks the subscribe, sends `UnsubscribeTagsRequest`, then releases the
|
||
subscribe and asserts `_adapter.UnsubscribeAsync` is called and
|
||
`TotalSubscribedTags == 0`.
|
||
|
||
**Description**
|
||
|
||
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
|
||
thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
|
||
same instance is processed on the actor thread between dispatch and completion,
|
||
`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
|
||
`_subscribers`. When the late `SubscribeCompleted` arrives,
|
||
`HandleSubscribeCompleted` (line 629-634) **re-creates** the
|
||
`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
|
||
to apply the results — but `_subscribers[instanceName]` was already removed by the
|
||
unsubscribe and is **not** re-added.
|
||
|
||
Consequences:
|
||
|
||
1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
|
||
no longer exists. `ReSubscribeAll` derives its tag list from
|
||
`_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
|
||
every future reconnect.
|
||
2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
|
||
reverse index treats the leaked instance as a real subscriber. The only way to
|
||
drop the count is another `HandleUnsubscribe` for the same instance — which can
|
||
never arrive because the Instance Actor that owned the instance is gone.
|
||
3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
|
||
`_unresolvedTags` for genuine resolution failures), drifting health counters
|
||
permanently above the actual subscribed instance count.
|
||
4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
|
||
skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
|
||
are silently dropped — but the work of fanning them out (the iteration and the
|
||
tag lookup) is still done for every value update on every leaked tag, forever.
|
||
5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
|
||
TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
|
||
`_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
|
||
resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
|
||
|
||
**Recommendation**
|
||
|
||
In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
|
||
treat the result as obsolete: log it and `return` without re-creating the entry or
|
||
applying any state mutations. Any successfully-created adapter subscriptions in
|
||
`msg.Results` should be cleaned up — iterate the results and
|
||
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
|
||
successful one so the orphan handles do not leak in the adapter. Add a regression
|
||
test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
|
||
for A while the subscribe I/O is in flight, completes the subscribe, and asserts
|
||
`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.
|
||
|
||
### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
|
||
|
||
**Resolution** — both call sites now gate
|
||
`Timers.StartPeriodicTimer("tag-resolution-retry", ...)` with
|
||
`!Timers.IsTimerActive("tag-resolution-retry")` so the first failure
|
||
arms the timer and subsequent failures only pile onto `_unresolvedTags`.
|
||
Regression test
|
||
`DCL022_BurstedFailedSubscribes_DoNotResetRetryTimer` fires 5 failed
|
||
subscribes within one retry interval and asserts the retry timer
|
||
actually fires within one interval of the first failure (pre-fix it
|
||
would have been pushed past the interval boundary by the resets).
|
||
|
||
**Description**
|
||
|
||
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
|
||
991-998) both call:
|
||
|
||
```
|
||
Timers.StartPeriodicTimer(
|
||
"tag-resolution-retry",
|
||
new RetryTagResolution(),
|
||
_options.TagResolutionRetryInterval,
|
||
_options.TagResolutionRetryInterval);
|
||
```
|
||
|
||
`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
|
||
existing timer registered under the same key. So every additional subscribe (or
|
||
every additional tag-resolution failure) that produces unresolved tags **resets** the
|
||
retry timer's countdown to the full interval — the timer never accumulates
|
||
elapsed time across calls.
|
||
|
||
With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
|
||
produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
|
||
deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
|
||
"periodic" retry never actually fires until subscribes go quiet. In a steady-state
|
||
site with many instances deploying together this can delay tag resolution by tens
|
||
of seconds, leaving attributes at `Bad` quality longer than the documented retry
|
||
interval implies.
|
||
|
||
**Recommendation**
|
||
|
||
Start the periodic timer once, when the actor first transitions to having
|
||
non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
|
||
been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
|
||
gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
|
||
calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply the
|
||
same gate at both call sites. Add a regression test that fires 5 subscribes with
|
||
unresolved tags within one retry interval and asserts the retry fires at most one
|
||
interval after the first failure (not after the fifth subscribe).
|