fix(data-connection-layer): resolve DataConnectionLayer-002/003/004/005 — Resume supervision, concurrent dicts, subscribe-failure classification, write timeout
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 12 |
|
||||
| Open findings | 8 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -101,7 +101,7 @@ whose message references `DataConnectionLayer-001`.
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
|
||||
|
||||
**Description**
|
||||
@@ -127,7 +127,20 @@ after a crash and surface the lost-state condition rather than failing silently.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed
|
||||
from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures.
|
||||
`Resume` keeps the existing actor instance and all its in-memory subscription state
|
||||
(`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters)
|
||||
intact across a transient handler exception, so the design doc's "transparent
|
||||
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
|
||||
coordinator and its own Become/Stash reconnect state machine already recovers
|
||||
connection-level faults — it does not need a restart. This also aligns with the
|
||||
ScadaLink convention of `Resume` for coordinator actors. Regression test
|
||||
`DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor
|
||||
via a synchronously-throwing write and asserts the subscription survives (health
|
||||
report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart`
|
||||
code and passes after. Fixed by the commit whose message references
|
||||
`DataConnectionLayer-002` (commit `<pending>`).
|
||||
|
||||
### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
|
||||
|
||||
@@ -135,7 +148,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
|
||||
|
||||
**Description**
|
||||
@@ -162,7 +175,18 @@ threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were
|
||||
changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two
|
||||
`Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read
|
||||
from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading
|
||||
`_callbacks`) concurrently with subscribe/disconnect mutations on other threads.
|
||||
`RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`,
|
||||
`Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA
|
||||
server, so the regression is guarded structurally by
|
||||
`DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting
|
||||
both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary`
|
||||
code and passes after. Fixed by the commit whose message references
|
||||
`DataConnectionLayer-003` (commit `<pending>`).
|
||||
|
||||
### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
|
||||
|
||||
@@ -170,7 +194,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
|
||||
|
||||
**Description**
|
||||
@@ -197,7 +221,21 @@ Instance Actor so it reflects the documented behaviour.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16. The subscribe background task now classifies each subscribe
|
||||
exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException`
|
||||
— thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/
|
||||
`IOException` count as connection-level; anything else is a genuine resolution
|
||||
failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure`
|
||||
and applied on the actor thread in `HandleSubscribeCompleted`: connection-level
|
||||
failures no longer become unresolved tags and instead drive the reconnection state
|
||||
machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler
|
||||
calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags`
|
||||
and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to
|
||||
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
|
||||
Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber`
|
||||
and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail
|
||||
against the pre-fix code and pass after. Fixed by the commit whose message references
|
||||
`DataConnectionLayer-004` (commit `<pending>`).
|
||||
|
||||
### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
|
||||
|
||||
@@ -205,7 +243,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
|
||||
|
||||
**Description**
|
||||
@@ -229,7 +267,19 @@ the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachab
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`,
|
||||
passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the
|
||||
continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base
|
||||
`OperationCanceledException`) is translated into a failed `WriteTagResponse` with a
|
||||
`"Write timeout after Ns"` message, so a hung device write is bounded and the failure
|
||||
is returned synchronously to the calling script (WP-11) instead of blocking until the
|
||||
script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts
|
||||
an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no
|
||||
change was needed there.) Regression test
|
||||
`DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose
|
||||
`WriteAsync` only completes when its token fires; it fails against the pre-fix
|
||||
unbounded code and passes after. Fixed by the commit whose message references
|
||||
`DataConnectionLayer-005` (commit `<pending>`).
|
||||
|
||||
### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
|
||||
|
||||
|
||||
Reference in New Issue
Block a user