fix(data-connection): resolve DataConnectionLayer-006..012 — quality-counter reconciliation, per-tag batch reads, configurable failover threshold, dedup retry, stale-callback guard, secure cert default

This commit is contained in:
Joseph Doherty
2026-05-16 21:11:24 -04:00
parent 0c82ffcbe6
commit c9b236e507
8 changed files with 515 additions and 34 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 8 |
| Open findings | 2 |
## Summary
@@ -287,7 +287,7 @@ unbounded code and passes after. Fixed by the commit whose message references
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` |
**Description**
@@ -303,6 +303,14 @@ decrement, and the totals can drift above `_totalSubscribed`. Over repeated
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
unreliable.
**Verification note**: Confirmed against source. The root cause is broader than the
reconnect path the finding describes: `HandleUnsubscribe` also never removes a tag
from `_lastTagQuality` nor decrements its quality bucket, so an unsubscribed tag
lingers and `PushBadQualityForAllTags` (which sets `_tagsBadQuality =
_lastTagQuality.Count`) over-counts it — driving the bad-quality count above
`_totalSubscribed` even without a re-subscribe. Both the unsubscribe leak and the
re-subscribe drift are real.
**Recommendation**
On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear
@@ -312,7 +320,17 @@ fresh `TagValueReceived` messages. Alternatively recompute the buckets from
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). `HandleUnsubscribe` now removes each
unsubscribed tag from `_lastTagQuality` and decrements the corresponding quality
bucket, then reports the corrected counters via `UpdateTagQuality`/`UpdateTagResolution`;
`ReSubscribeAll` clears `_lastTagQuality` and zeroes the three quality counters so
post-reconnect tags are repopulated from fresh `TagValueReceived` messages instead of
only incrementing. Regression test
`DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags` subscribes two
tags, pushes Good values, unsubscribes one, then disconnects and asserts
`PushBadQualityForAllTags` reports exactly 1 bad tag (the reconnect is gated open so
`ReSubscribeAll` does not run before the assertion); it reports 2 against the pre-fix
code and 1 after.
### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag
@@ -320,7 +338,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` |
**Description**
@@ -333,6 +351,9 @@ has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The bat
also fully serial (one round-trip per tag), defeating the point of a batch API; the
design doc lists `ReadBatch`/`WriteBatch` as first-class operations.
**Verification note**: Confirmed against source — `ReadAsync` re-throws on any
non-`OperationCanceledException`, aborting the whole batch.
**Recommendation**
Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag
@@ -342,7 +363,17 @@ for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). `ReadBatchAsync` now wraps each per-tag
`ReadAsync` in a try/catch: a per-tag exception is recorded as a failed `ReadResult`
(`Success: false`, message = the exception message) so the batch returns a complete
result map for every requested tag; `OperationCanceledException` is still propagated
so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call
optimisation were deliberately left for a follow-up — they are a performance concern,
not the correctness bug this finding raised. Regression test
`DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails` reads three tags where the
middle one throws and asserts all three appear in the result map with the failing one
marked unsuccessful; it threw (no map returned) against the pre-fix code and passes
after.
### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly
@@ -379,9 +410,9 @@ _Unresolved._
| | |
|--|--|
| Severity | Medium |
| Severity | Medium — partially design-doc work outside this module's editable scope |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` |
**Description**
@@ -398,6 +429,10 @@ all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
`DataConnectionOptions`.
**Verification note**: Confirmed against source. The hard-coded
`StableConnectionThreshold = TimeSpan.FromSeconds(60)` `static readonly` field and the
`_consecutiveUnstableDisconnects` failover path both exist as described.
**Recommendation**
Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover
@@ -406,7 +441,19 @@ path and the stability threshold, and move the 60 s threshold into
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). The configurability half of the recommendation
is done: the hard-coded `StableConnectionThreshold` constant was removed from
`DataConnectionActor` and replaced with a new `DataConnectionOptions.StableConnectionThreshold`
property (60 s default), bindable from the `DataConnectionLayer` `appsettings.json`
section like `ReconnectInterval`/`TagResolutionRetryInterval`/`WriteTimeout`. Regression
test `DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault` guards
the default and the setter. **The documentation half is out of this module's editable
scope** — `docs/requirements/Component-DataConnectionLayer.md` (lines 73-85) still
describes only the connect-failure failover path and does not mention the
unstable-disconnect trigger. **Action required (surfaced):** the DCL design doc should
be updated to document the unstable-disconnect failover path and the configurable
stability threshold; that edit was deliberately not made here because this task is
scoped to `src/ScadaLink.DataConnectionLayer`, tests, and this findings file only.
### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
@@ -414,7 +461,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` |
**Description**
@@ -429,6 +476,10 @@ monitored items / leaked subscription IDs (the second success overwrites
with no `UnsubscribeAsync` call). The timer-cancel condition in
`HandleTagResolutionSucceeded` is also non-deterministic for the same reason.
**Verification note**: Confirmed against source — `HandleRetryTagResolution` dispatched
`SubscribeAsync` for every tag in `_unresolvedTags` on every tick with no in-flight
guard.
**Recommendation**
Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is
@@ -437,7 +488,18 @@ subscribe attempts and makes the timer-cancel condition deterministic.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). A new `_resolutionInFlight` `HashSet<string>`
tracks tags whose retry `SubscribeAsync` is currently outstanding.
`HandleRetryTagResolution` now dispatches only for unresolved tags **not** already in
flight (and skips entirely if all are in flight), adding each dispatched tag to the
set; `HandleTagResolutionSucceeded` and `HandleTagResolutionFailed` remove the tag
from the set when its attempt completes, and `HandleUnsubscribe`/`ReSubscribeAll`
clear stale entries. This prevents overlapping duplicate subscribe attempts and the
resulting orphaned monitored items. Regression test
`DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes` gives a tag a
genuine initial failure then a retry `SubscribeAsync` that never completes, lets six
100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total
subscribe calls); the pre-fix code dispatched on every tick (6 total).
### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
@@ -445,7 +507,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` |
**Description**
@@ -460,6 +522,10 @@ data with the new endpoint's data and briefly reporting a value the active endpo
never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to
distinguish current from stale callbacks.
**Verification note**: Confirmed against source — `TagValueReceived` carried no
adapter identity, and `HandleTagValueReceived` (reachable in `Connected`) processed
any such message regardless of which adapter produced it.
**Recommendation**
Add an adapter-generation counter incremented on every adapter swap; stamp it onto
@@ -468,15 +534,28 @@ generation does not match the current adapter in `HandleTagValueReceived`.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new
`_adapterGeneration` `int` field is incremented at both adapter-swap sites (the
unstable-disconnect failover in `BecomeReconnecting` and the connect-failure failover
in `HandleReconnectResult`). The `TagValueReceived` record gained an
`AdapterGeneration` field; every subscription callback closure (`HandleSubscribe`, the
initial-read seed, `HandleRetryTagResolution`, `ReSubscribeAll`) captures the
generation in effect at subscribe time and stamps it onto each `TagValueReceived`.
`HandleTagValueReceived` drops any message whose generation no longer matches the
current adapter, so a callback fired by a disposed adapter after failover cannot reach
an Instance Actor. Regression test
`DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover` subscribes on the
primary, fails over to the backup, then invokes the captured primary callback with a
stale value and asserts the subscriber receives nothing; the stale value reached the
subscriber against the pre-fix code and is dropped after.
### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate
| | |
|--|--|
| Severity | Medium |
| Severity | Medium — full secure default also requires a Commons + design-doc change outside this module |
| Category | Security |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` |
**Description**
@@ -490,6 +569,13 @@ UA link. The design doc explicitly lists `true` as the default. For an industria
control link this is a meaningful exposure; a secure-by-default posture would reject
untrusted certs unless an operator opts in per connection.
**Verification note**: Confirmed against source. Note the *authoritative* runtime
default does not actually live on `OpcUaConnectionOptions` — for a real connection
`OpcUaDataConnection.ConnectAsync` builds `OpcUaConnectionOptions` from
`OpcUaEndpointConfig` (in `ScadaLink.Commons`), whose `AutoAcceptUntrustedCerts`
property also defaults to `true`. `OpcUaConnectionOptions`' own default is only the
fallback used when an `OpcUaConnectionOptions` is constructed directly.
**Recommendation**
Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection
@@ -498,7 +584,21 @@ installed. Update the design doc to reflect the secure default.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation
are done: (1) `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` now defaults to
`false`; (2) `RealOpcUaClient.ConnectAsync` logs a prominent `ILogger` warning
whenever the auto-accept certificate validator is installed (an `ILogger<RealOpcUaClient>`
was added as an optional constructor parameter, defaulting to `NullLogger`, so
existing callers are unaffected). Regression test
`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse` guards the
new secure default. **Two parts remain outside this module's editable scope and are
surfaced as action required:** (a) `ScadaLink.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts`
still defaults to `true` — since that is the value actually used for a real connection
(see verification note above), the Commons default must also be flipped to `false`
for the system to be secure-by-default; (b) `docs/requirements/Component-DataConnectionLayer.md`
line 116 still documents `true` as the default and must be updated. Both edits were
deliberately not made here because this task is scoped to
`src/ScadaLink.DataConnectionLayer`, tests, and this findings file only.
### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide