fix(data-connection-layer): resolve DataConnectionLayer-002/003/004/005 — Resume supervision, concurrent dicts, subscribe-failure classification, write timeout
This commit is contained in:
@@ -8,7 +8,7 @@
|
|||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-16 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `9c60592` |
|
||||||
| Open findings | 12 |
|
| Open findings | 8 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -101,7 +101,7 @@ whose message references `DataConnectionLayer-001`.
|
|||||||
|--|--|
|
|--|--|
|
||||||
| Severity | High |
|
| Severity | High |
|
||||||
| Category | Akka.NET conventions |
|
| Category | Akka.NET conventions |
|
||||||
| Status | Open |
|
| Status | Resolved |
|
||||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
|
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
|
||||||
|
|
||||||
**Description**
|
**Description**
|
||||||
@@ -127,7 +127,20 @@ after a crash and surface the lost-state condition rather than failing silently.
|
|||||||
|
|
||||||
**Resolution**
|
**Resolution**
|
||||||
|
|
||||||
_Unresolved._
|
Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed
|
||||||
|
from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures.
|
||||||
|
`Resume` keeps the existing actor instance and all its in-memory subscription state
|
||||||
|
(`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters)
|
||||||
|
intact across a transient handler exception, so the design doc's "transparent
|
||||||
|
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
|
||||||
|
coordinator and its own Become/Stash reconnect state machine already recovers
|
||||||
|
connection-level faults — it does not need a restart. This also aligns with the
|
||||||
|
ScadaLink convention of `Resume` for coordinator actors. Regression test
|
||||||
|
`DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor
|
||||||
|
via a synchronously-throwing write and asserts the subscription survives (health
|
||||||
|
report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart`
|
||||||
|
code and passes after. Fixed by the commit whose message references
|
||||||
|
`DataConnectionLayer-002` (commit `<pending>`).
|
||||||
|
|
||||||
### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
|
### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
|
||||||
|
|
||||||
@@ -135,7 +148,7 @@ _Unresolved._
|
|||||||
|--|--|
|
|--|--|
|
||||||
| Severity | High |
|
| Severity | High |
|
||||||
| Category | Concurrency & thread safety |
|
| Category | Concurrency & thread safety |
|
||||||
| Status | Open |
|
| Status | Resolved |
|
||||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
|
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
|
||||||
|
|
||||||
**Description**
|
**Description**
|
||||||
@@ -162,7 +175,18 @@ threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
|
|||||||
|
|
||||||
**Resolution**
|
**Resolution**
|
||||||
|
|
||||||
_Unresolved._
|
Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were
|
||||||
|
changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two
|
||||||
|
`Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read
|
||||||
|
from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading
|
||||||
|
`_callbacks`) concurrently with subscribe/disconnect mutations on other threads.
|
||||||
|
`RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`,
|
||||||
|
`Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA
|
||||||
|
server, so the regression is guarded structurally by
|
||||||
|
`DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting
|
||||||
|
both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary`
|
||||||
|
code and passes after. Fixed by the commit whose message references
|
||||||
|
`DataConnectionLayer-003` (commit `<pending>`).
|
||||||
|
|
||||||
### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
|
### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
|
||||||
|
|
||||||
@@ -170,7 +194,7 @@ _Unresolved._
|
|||||||
|--|--|
|
|--|--|
|
||||||
| Severity | High |
|
| Severity | High |
|
||||||
| Category | Error handling & resilience |
|
| Category | Error handling & resilience |
|
||||||
| Status | Open |
|
| Status | Resolved |
|
||||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
|
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
|
||||||
|
|
||||||
**Description**
|
**Description**
|
||||||
@@ -197,7 +221,21 @@ Instance Actor so it reflects the documented behaviour.
|
|||||||
|
|
||||||
**Resolution**
|
**Resolution**
|
||||||
|
|
||||||
_Unresolved._
|
Resolved 2026-05-16. The subscribe background task now classifies each subscribe
|
||||||
|
exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException`
|
||||||
|
— thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/
|
||||||
|
`IOException` count as connection-level; anything else is a genuine resolution
|
||||||
|
failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure`
|
||||||
|
and applied on the actor thread in `HandleSubscribeCompleted`: connection-level
|
||||||
|
failures no longer become unresolved tags and instead drive the reconnection state
|
||||||
|
machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler
|
||||||
|
calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags`
|
||||||
|
and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to
|
||||||
|
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
|
||||||
|
Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber`
|
||||||
|
and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail
|
||||||
|
against the pre-fix code and pass after. Fixed by the commit whose message references
|
||||||
|
`DataConnectionLayer-004` (commit `<pending>`).
|
||||||
|
|
||||||
### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
|
### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
|
||||||
|
|
||||||
@@ -205,7 +243,7 @@ _Unresolved._
|
|||||||
|--|--|
|
|--|--|
|
||||||
| Severity | High |
|
| Severity | High |
|
||||||
| Category | Design-document adherence |
|
| Category | Design-document adherence |
|
||||||
| Status | Open |
|
| Status | Resolved |
|
||||||
| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
|
| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
|
||||||
|
|
||||||
**Description**
|
**Description**
|
||||||
@@ -229,7 +267,19 @@ the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachab
|
|||||||
|
|
||||||
**Resolution**
|
**Resolution**
|
||||||
|
|
||||||
_Unresolved._
|
Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`,
|
||||||
|
passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the
|
||||||
|
continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base
|
||||||
|
`OperationCanceledException`) is translated into a failed `WriteTagResponse` with a
|
||||||
|
`"Write timeout after Ns"` message, so a hung device write is bounded and the failure
|
||||||
|
is returned synchronously to the calling script (WP-11) instead of blocking until the
|
||||||
|
script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts
|
||||||
|
an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no
|
||||||
|
change was needed there.) Regression test
|
||||||
|
`DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose
|
||||||
|
`WriteAsync` only completes when its token fires; it fails against the pre-fix
|
||||||
|
unbounded code and passes after. Fixed by the commit whose message references
|
||||||
|
`DataConnectionLayer-005` (commit `<pending>`).
|
||||||
|
|
||||||
### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
|
### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
|
||||||
|
|
||||||
|
|||||||
@@ -213,7 +213,13 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
HandleSubscribe(req);
|
HandleSubscribe(req);
|
||||||
break;
|
break;
|
||||||
case SubscribeCompleted sc:
|
case SubscribeCompleted sc:
|
||||||
HandleSubscribeCompleted(sc);
|
// In Connected state, a connection-level subscribe failure must drive
|
||||||
|
// the reconnection state machine (DataConnectionLayer-004).
|
||||||
|
if (HandleSubscribeCompleted(sc))
|
||||||
|
{
|
||||||
|
_log.Warning("[{0}] Connection-level subscribe failure — entering Reconnecting", _connectionName);
|
||||||
|
BecomeReconnecting();
|
||||||
|
}
|
||||||
break;
|
break;
|
||||||
case UnsubscribeTagsRequest req:
|
case UnsubscribeTagsRequest req:
|
||||||
HandleUnsubscribe(req);
|
HandleUnsubscribe(req);
|
||||||
@@ -514,8 +520,14 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
}
|
}
|
||||||
catch (Exception ex)
|
catch (Exception ex)
|
||||||
{
|
{
|
||||||
// WP-12: Tag path resolution failure — reported back as unresolved.
|
// DataConnectionLayer-004: distinguish a connection-level fault
|
||||||
results.Add(new SubscribeTagResult(tagPath, AlreadySubscribed: false, Success: false, null, ex.Message));
|
// (adapter not connected / transport down) from a genuine
|
||||||
|
// node-not-found. Connection-level faults must drive the
|
||||||
|
// reconnection state machine, not be retried as unresolved tags.
|
||||||
|
var connectionLevel = IsConnectionLevelFailure(ex);
|
||||||
|
results.Add(new SubscribeTagResult(
|
||||||
|
tagPath, AlreadySubscribed: false, Success: false, null, ex.Message,
|
||||||
|
ConnectionLevelFailure: connectionLevel));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -546,8 +558,10 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
/// Applies the result of an asynchronous subscribe on the actor thread. ALL mutation
|
/// Applies the result of an asynchronous subscribe on the actor thread. ALL mutation
|
||||||
/// of subscription state and counters happens here — never on the background task —
|
/// of subscription state and counters happens here — never on the background task —
|
||||||
/// so the actor model's single-threaded state guarantee holds.
|
/// so the actor model's single-threaded state guarantee holds.
|
||||||
|
/// Returns <c>true</c> if any tag failed at connection level (DataConnectionLayer-004),
|
||||||
|
/// signalling the caller (only the Connected state) to enter Reconnecting.
|
||||||
/// </summary>
|
/// </summary>
|
||||||
private void HandleSubscribeCompleted(SubscribeCompleted msg)
|
private bool HandleSubscribeCompleted(SubscribeCompleted msg)
|
||||||
{
|
{
|
||||||
var instanceName = msg.Request.InstanceUniqueName;
|
var instanceName = msg.Request.InstanceUniqueName;
|
||||||
if (!_subscriptionsByInstance.TryGetValue(instanceName, out var instanceTags))
|
if (!_subscriptionsByInstance.TryGetValue(instanceName, out var instanceTags))
|
||||||
@@ -557,6 +571,12 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
_subscriptionsByInstance[instanceName] = instanceTags;
|
_subscriptionsByInstance[instanceName] = instanceTags;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// DataConnectionLayer-004: if any tag failed because the adapter is not
|
||||||
|
// connected (a connection-level fault), the subscribe needs the reconnection
|
||||||
|
// state machine, not the tag-resolution retry. Drive a disconnect and let the
|
||||||
|
// request be re-stashed/retried after reconnect via ReSubscribeAll.
|
||||||
|
var connectionLevelFailure = msg.Results.Any(r => !r.Success && r.ConnectionLevelFailure);
|
||||||
|
|
||||||
foreach (var result in msg.Results)
|
foreach (var result in msg.Results)
|
||||||
{
|
{
|
||||||
instanceTags.Add(result.TagPath);
|
instanceTags.Add(result.TagPath);
|
||||||
@@ -572,13 +592,31 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
_totalSubscribed++;
|
_totalSubscribed++;
|
||||||
_resolvedTags++;
|
_resolvedTags++;
|
||||||
}
|
}
|
||||||
|
else if (result.ConnectionLevelFailure)
|
||||||
|
{
|
||||||
|
// Connection-level fault — do not count as an unresolved tag.
|
||||||
|
// ReSubscribeAll after reconnect derives the tag from
|
||||||
|
// _subscriptionsByInstance (already updated above).
|
||||||
|
_log.Warning("[{0}] Subscribe for {1} failed at connection level: {2}",
|
||||||
|
_connectionName, result.TagPath, result.Error);
|
||||||
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
// WP-12: mark unresolved so the periodic retry timer picks it up.
|
// WP-12: genuine tag resolution failure — mark unresolved so the
|
||||||
|
// periodic retry timer picks it up.
|
||||||
_unresolvedTags.Add(result.TagPath);
|
_unresolvedTags.Add(result.TagPath);
|
||||||
_totalSubscribed++;
|
_totalSubscribed++;
|
||||||
_log.Debug("[{0}] Tag resolution failed for {1}: {2}",
|
_log.Debug("[{0}] Tag resolution failed for {1}: {2}",
|
||||||
_connectionName, result.TagPath, result.Error);
|
_connectionName, result.TagPath, result.Error);
|
||||||
|
|
||||||
|
// DataConnectionLayer-004 / design doc Tag Path Resolution step 2:
|
||||||
|
// mark the attribute quality `bad` so the Instance Actor sees a
|
||||||
|
// signal rather than staying Uncertain indefinitely.
|
||||||
|
if (_subscribers.TryGetValue(instanceName, out var subscriber))
|
||||||
|
{
|
||||||
|
subscriber.Tell(new TagValueUpdate(
|
||||||
|
_connectionName, result.TagPath, null, QualityCode.Bad, DateTimeOffset.UtcNow));
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -594,6 +632,27 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
|
|
||||||
msg.ReplyTo.Tell(new SubscribeTagsResponse(
|
msg.ReplyTo.Tell(new SubscribeTagsResponse(
|
||||||
msg.Request.CorrelationId, instanceName, true, null, DateTimeOffset.UtcNow));
|
msg.Request.CorrelationId, instanceName, true, null, DateTimeOffset.UtcNow));
|
||||||
|
|
||||||
|
// The caller (Connected state only) decides whether to enter Reconnecting.
|
||||||
|
// In Connecting/Reconnecting the connection is not established anyway, so the
|
||||||
|
// existing reconnect cycle handles recovery without a re-trigger here.
|
||||||
|
return connectionLevelFailure;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// DataConnectionLayer-004: classifies a subscribe exception as a connection-level
|
||||||
|
/// fault (adapter not connected / transport down) versus a genuine tag-resolution
|
||||||
|
/// failure (the node does not exist on the device). Connection-level faults must
|
||||||
|
/// drive the reconnection state machine; resolution failures are retried on the
|
||||||
|
/// tag-resolution timer.
|
||||||
|
/// </summary>
|
||||||
|
private static bool IsConnectionLevelFailure(Exception ex)
|
||||||
|
{
|
||||||
|
var baseEx = ex is AggregateException agg ? agg.GetBaseException() : ex;
|
||||||
|
return baseEx is InvalidOperationException
|
||||||
|
or System.Net.Sockets.SocketException
|
||||||
|
or TimeoutException
|
||||||
|
or System.IO.IOException;
|
||||||
}
|
}
|
||||||
|
|
||||||
private void HandleUnsubscribe(UnsubscribeTagsRequest request)
|
private void HandleUnsubscribe(UnsubscribeTagsRequest request)
|
||||||
@@ -634,15 +693,29 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
_log.Debug("[{0}] Writing to tag {1}", _connectionName, request.TagPath);
|
_log.Debug("[{0}] Writing to tag {1}", _connectionName, request.TagPath);
|
||||||
var sender = Sender;
|
var sender = Sender;
|
||||||
|
|
||||||
|
// DataConnectionLayer-005: bound the write with WriteTimeout. A hung device
|
||||||
|
// write (TCP black-hole) would otherwise never complete, so PipeTo never
|
||||||
|
// fires and the calling script gets no DCL-level error. The CancellationToken
|
||||||
|
// is passed to the adapter; on timeout we translate cancellation into a
|
||||||
|
// failed WriteTagResponse so the failure is returned synchronously (WP-11).
|
||||||
|
var cts = new CancellationTokenSource(_options.WriteTimeout);
|
||||||
|
|
||||||
// WP-11: Write through DCL to device, failure returned synchronously
|
// WP-11: Write through DCL to device, failure returned synchronously
|
||||||
_adapter.WriteAsync(request.TagPath, request.Value).ContinueWith(t =>
|
_adapter.WriteAsync(request.TagPath, request.Value, cts.Token).ContinueWith(t =>
|
||||||
{
|
{
|
||||||
|
cts.Dispose();
|
||||||
if (t.IsCompletedSuccessfully)
|
if (t.IsCompletedSuccessfully)
|
||||||
{
|
{
|
||||||
var result = t.Result;
|
var result = t.Result;
|
||||||
return new WriteTagResponse(
|
return new WriteTagResponse(
|
||||||
request.CorrelationId, result.Success, result.ErrorMessage, DateTimeOffset.UtcNow);
|
request.CorrelationId, result.Success, result.ErrorMessage, DateTimeOffset.UtcNow);
|
||||||
}
|
}
|
||||||
|
if (t.IsCanceled || t.Exception?.GetBaseException() is OperationCanceledException)
|
||||||
|
{
|
||||||
|
return new WriteTagResponse(
|
||||||
|
request.CorrelationId, false,
|
||||||
|
$"Write timeout after {_options.WriteTimeout.TotalSeconds:F0}s", DateTimeOffset.UtcNow);
|
||||||
|
}
|
||||||
return new WriteTagResponse(
|
return new WriteTagResponse(
|
||||||
request.CorrelationId, false, t.Exception?.GetBaseException().Message, DateTimeOffset.UtcNow);
|
request.CorrelationId, false, t.Exception?.GetBaseException().Message, DateTimeOffset.UtcNow);
|
||||||
}).PipeTo(sender);
|
}).PipeTo(sender);
|
||||||
@@ -824,7 +897,8 @@ public class DataConnectionActor : UntypedActor, IWithStash, IWithTimers
|
|||||||
internal record TagResolutionSucceeded(string TagPath, string SubscriptionId);
|
internal record TagResolutionSucceeded(string TagPath, string SubscriptionId);
|
||||||
internal record RetryTagResolution;
|
internal record RetryTagResolution;
|
||||||
internal record SubscribeTagResult(
|
internal record SubscribeTagResult(
|
||||||
string TagPath, bool AlreadySubscribed, bool Success, string? SubscriptionId, string? Error);
|
string TagPath, bool AlreadySubscribed, bool Success, string? SubscriptionId, string? Error,
|
||||||
|
bool ConnectionLevelFailure = false);
|
||||||
internal record SubscribeCompleted(
|
internal record SubscribeCompleted(
|
||||||
SubscribeTagsRequest Request, IActorRef ReplyTo, IReadOnlyList<SubscribeTagResult> Results);
|
SubscribeTagsRequest Request, IActorRef ReplyTo, IReadOnlyList<SubscribeTagResult> Results);
|
||||||
public record GetHealthReport;
|
public record GetHealthReport;
|
||||||
|
|||||||
@@ -125,8 +125,20 @@ public class DataConnectionManagerActor : ReceiveActor
|
|||||||
}
|
}
|
||||||
|
|
||||||
/// <summary>
|
/// <summary>
|
||||||
/// OneForOneStrategy with Restart for connection actors — a failed connection
|
/// OneForOneStrategy with Resume for connection actors.
|
||||||
/// should restart and attempt reconnection.
|
///
|
||||||
|
/// DataConnectionLayer-002: a DataConnectionActor is a long-lived, stateful
|
||||||
|
/// coordinator — its in-memory subscription registry (_subscriptionsByInstance,
|
||||||
|
/// _subscriptionIds, _subscribers) is the only record of which Instance Actors
|
||||||
|
/// subscribed to which tags, and there is no durable store to rebuild it from.
|
||||||
|
/// Restart would create a fresh instance and silently discard that registry,
|
||||||
|
/// breaking the design doc's "transparent re-subscribe" guarantee (WP-10):
|
||||||
|
/// subscribers would never be re-subscribed and would sit at stale quality with
|
||||||
|
/// no error. Resume keeps the actor instance and its state intact, so a transient
|
||||||
|
/// exception in a message handler does not lose subscription state. The actor's
|
||||||
|
/// own Become/Stash reconnect state machine already recovers connection-level
|
||||||
|
/// faults, so it does not need a restart to re-establish the connection.
|
||||||
|
/// This matches the ScadaLink convention of Resume for coordinator actors.
|
||||||
/// </summary>
|
/// </summary>
|
||||||
protected override SupervisorStrategy SupervisorStrategy()
|
protected override SupervisorStrategy SupervisorStrategy()
|
||||||
{
|
{
|
||||||
@@ -135,8 +147,8 @@ public class DataConnectionManagerActor : ReceiveActor
|
|||||||
withinTimeRange: TimeSpan.FromMinutes(1),
|
withinTimeRange: TimeSpan.FromMinutes(1),
|
||||||
decider: Decider.From(ex =>
|
decider: Decider.From(ex =>
|
||||||
{
|
{
|
||||||
_log.Warning(ex, "DataConnectionActor threw exception, restarting");
|
_log.Warning(ex, "DataConnectionActor threw exception, resuming (subscription state preserved)");
|
||||||
return Directive.Restart;
|
return Directive.Resume;
|
||||||
}));
|
}));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,3 +1,4 @@
|
|||||||
|
using System.Collections.Concurrent;
|
||||||
using System.Security.Cryptography.X509Certificates;
|
using System.Security.Cryptography.X509Certificates;
|
||||||
using Opc.Ua;
|
using Opc.Ua;
|
||||||
using Opc.Ua.Client;
|
using Opc.Ua.Client;
|
||||||
@@ -13,8 +14,14 @@ public class RealOpcUaClient : IOpcUaClient
|
|||||||
{
|
{
|
||||||
private ISession? _session;
|
private ISession? _session;
|
||||||
private Subscription? _subscription;
|
private Subscription? _subscription;
|
||||||
private readonly Dictionary<string, MonitoredItem> _monitoredItems = new();
|
|
||||||
private readonly Dictionary<string, Action<string, object?, DateTime, uint>> _callbacks = new();
|
// DataConnectionLayer-003: these maps are read from the OPC Foundation SDK's
|
||||||
|
// internal publish threads (the MonitoredItem.Notification handler reads
|
||||||
|
// _callbacks) concurrently with subscribe/disconnect mutations that run on
|
||||||
|
// thread-pool threads. Plain Dictionary access during a concurrent resize or
|
||||||
|
// Clear() is undefined behaviour, so they must be ConcurrentDictionary.
|
||||||
|
private readonly ConcurrentDictionary<string, MonitoredItem> _monitoredItems = new();
|
||||||
|
private readonly ConcurrentDictionary<string, Action<string, object?, DateTime, uint>> _callbacks = new();
|
||||||
private volatile bool _connectionLostFired;
|
private volatile bool _connectionLostFired;
|
||||||
private OpcUaConnectionOptions _options = new();
|
private OpcUaConnectionOptions _options = new();
|
||||||
private readonly OpcUaGlobalOptions _globalOptions;
|
private readonly OpcUaGlobalOptions _globalOptions;
|
||||||
@@ -180,8 +187,8 @@ public class RealOpcUaClient : IOpcUaClient
|
|||||||
{
|
{
|
||||||
_subscription.RemoveItem(item);
|
_subscription.RemoveItem(item);
|
||||||
await _subscription.ApplyChangesAsync(cancellationToken);
|
await _subscription.ApplyChangesAsync(cancellationToken);
|
||||||
_monitoredItems.Remove(subscriptionHandle);
|
_monitoredItems.TryRemove(subscriptionHandle, out _);
|
||||||
_callbacks.Remove(subscriptionHandle);
|
_callbacks.TryRemove(subscriptionHandle, out _);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -512,6 +512,106 @@ public class DataConnectionActorTests : TestKit
|
|||||||
Assert.Equal(instances * tagsPerInstance, report.ResolvedTags);
|
Assert.Equal(instances * tagsPerInstance, report.ResolvedTags);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── DataConnectionLayer-004: subscribe-time failure classification ──
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber()
|
||||||
|
{
|
||||||
|
// Regression test for DataConnectionLayer-004. When a tag genuinely fails to
|
||||||
|
// resolve at subscribe time, the design doc (Tag Path Resolution, step 2)
|
||||||
|
// requires the attribute to be marked quality `bad`. The pre-fix code only
|
||||||
|
// logged and added the tag to _unresolvedTags — the Instance Actor never got
|
||||||
|
// a signal. After the fix, a bad-quality TagValueUpdate is pushed.
|
||||||
|
_mockAdapter.ConnectAsync(Arg.Any<IDictionary<string, string>>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(Task.CompletedTask);
|
||||||
|
_mockAdapter.Status.Returns(ConnectionHealth.Connected);
|
||||||
|
// Genuine node-not-found: a non-connection exception.
|
||||||
|
_mockAdapter.SubscribeAsync("missing/tag", Arg.Any<SubscriptionCallback>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(Task.FromException<string>(new KeyNotFoundException("node not found")));
|
||||||
|
_mockAdapter.ReadAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(new ReadResult(false, null, null));
|
||||||
|
|
||||||
|
var actor = CreateConnectionActor("dcl004-bad-quality");
|
||||||
|
await Task.Delay(300);
|
||||||
|
|
||||||
|
actor.Tell(new SubscribeTagsRequest(
|
||||||
|
"c1", "inst1", "dcl004-bad-quality", ["missing/tag"], DateTimeOffset.UtcNow));
|
||||||
|
|
||||||
|
// Two messages arrive: the subscribe ack and a bad-quality update for the tag.
|
||||||
|
var bad = ExpectMsg<TagValueUpdate>(TimeSpan.FromSeconds(5));
|
||||||
|
Assert.Equal("missing/tag", bad.TagPath);
|
||||||
|
Assert.Equal(QualityCode.Bad, bad.Quality);
|
||||||
|
|
||||||
|
var ack = ExpectMsg<SubscribeTagsResponse>(TimeSpan.FromSeconds(5));
|
||||||
|
Assert.True(ack.Success);
|
||||||
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry()
|
||||||
|
{
|
||||||
|
// Regression test for DataConnectionLayer-004. A subscribe failing because the
|
||||||
|
// adapter is not connected (InvalidOperationException from EnsureConnected) is
|
||||||
|
// a connection problem, not a bad tag path. The pre-fix code misclassified it
|
||||||
|
// as an unresolved tag and retried it on the 10s tag-resolution timer. After
|
||||||
|
// the fix it drives the reconnection state machine instead.
|
||||||
|
_mockAdapter.ConnectAsync(Arg.Any<IDictionary<string, string>>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(Task.CompletedTask);
|
||||||
|
_mockAdapter.Status.Returns(ConnectionHealth.Connected);
|
||||||
|
_mockAdapter.SubscribeAsync(Arg.Any<string>(), Arg.Any<SubscriptionCallback>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(Task.FromException<string>(
|
||||||
|
new InvalidOperationException("OPC UA client is not connected.")));
|
||||||
|
_mockAdapter.ReadAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(new ReadResult(false, null, null));
|
||||||
|
|
||||||
|
var actor = CreateConnectionActor("dcl004-conn-level");
|
||||||
|
await Task.Delay(300);
|
||||||
|
|
||||||
|
actor.Tell(new SubscribeTagsRequest(
|
||||||
|
"c1", "inst1", "dcl004-conn-level", ["some/tag"], DateTimeOffset.UtcNow));
|
||||||
|
|
||||||
|
// The connection-level failure must drive the actor into Reconnecting, which
|
||||||
|
// re-attempts ConnectAsync. Pre-fix the actor stayed Connected and only armed
|
||||||
|
// the tag-resolution timer, so ConnectAsync is called exactly once.
|
||||||
|
AwaitCondition(() =>
|
||||||
|
_mockAdapter.ReceivedCalls().Count(c => c.GetMethodInfo().Name == "ConnectAsync") >= 2,
|
||||||
|
TimeSpan.FromSeconds(5));
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── DataConnectionLayer-005: WriteTimeout must bound a hung write ──
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously()
|
||||||
|
{
|
||||||
|
// Regression test for DataConnectionLayer-005. HandleWrite called WriteAsync
|
||||||
|
// with no CancellationToken and no timeout, so a hung device write never
|
||||||
|
// produced a WriteTagResponse. The calling script would block until its own
|
||||||
|
// Ask-timeout with no DCL-level error. After the fix, _options.WriteTimeout
|
||||||
|
// bounds the write and a timeout is surfaced as a failed WriteTagResponse.
|
||||||
|
_options.WriteTimeout = TimeSpan.FromMilliseconds(300);
|
||||||
|
|
||||||
|
_mockAdapter.ConnectAsync(Arg.Any<IDictionary<string, string>>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(Task.CompletedTask);
|
||||||
|
|
||||||
|
// WriteAsync never completes unless its cancellation token fires.
|
||||||
|
_mockAdapter.WriteAsync("tag1", 42, Arg.Any<CancellationToken>())
|
||||||
|
.Returns(ci =>
|
||||||
|
{
|
||||||
|
var ct = ci.Arg<CancellationToken>();
|
||||||
|
var tcs = new TaskCompletionSource<WriteResult>();
|
||||||
|
ct.Register(() => tcs.TrySetCanceled(ct));
|
||||||
|
return tcs.Task;
|
||||||
|
});
|
||||||
|
|
||||||
|
var actor = CreateConnectionActor("dcl005-write-timeout");
|
||||||
|
await Task.Delay(300); // reach Connected state
|
||||||
|
|
||||||
|
actor.Tell(new WriteTagRequest("corr1", "dcl005-write-timeout", "tag1", 42, DateTimeOffset.UtcNow));
|
||||||
|
|
||||||
|
var response = ExpectMsg<WriteTagResponse>(TimeSpan.FromSeconds(3));
|
||||||
|
Assert.False(response.Success);
|
||||||
|
Assert.Contains("timeout", response.ErrorMessage, StringComparison.OrdinalIgnoreCase);
|
||||||
|
}
|
||||||
|
|
||||||
[Fact]
|
[Fact]
|
||||||
public async Task DCL001_SubscribeWithFailedTags_CountsResolvedAndUnresolvedSeparately()
|
public async Task DCL001_SubscribeWithFailedTags_CountsResolvedAndUnresolvedSeparately()
|
||||||
{
|
{
|
||||||
@@ -533,7 +633,11 @@ public class DataConnectionActorTests : TestKit
|
|||||||
actor.Tell(new SubscribeTagsRequest(
|
actor.Tell(new SubscribeTagsRequest(
|
||||||
"c1", "inst1", "dcl001-failed-tags",
|
"c1", "inst1", "dcl001-failed-tags",
|
||||||
["good/a", "good/b", "good/c", "bad/x", "bad/y"], DateTimeOffset.UtcNow));
|
["good/a", "good/b", "good/c", "bad/x", "bad/y"], DateTimeOffset.UtcNow));
|
||||||
ExpectMsg<SubscribeTagsResponse>(TimeSpan.FromSeconds(5));
|
|
||||||
|
// Two genuine resolution failures now also push a bad-quality TagValueUpdate
|
||||||
|
// to the subscriber (DataConnectionLayer-004); skip past those to the ack.
|
||||||
|
var ack = FishForMessage<SubscribeTagsResponse>(_ => true, TimeSpan.FromSeconds(5));
|
||||||
|
Assert.True(ack.Success);
|
||||||
|
|
||||||
actor.Tell(new DataConnectionActor.GetHealthReport());
|
actor.Tell(new DataConnectionActor.GetHealthReport());
|
||||||
var report = ExpectMsg<DataConnectionHealthReport>(TimeSpan.FromSeconds(3));
|
var report = ExpectMsg<DataConnectionHealthReport>(TimeSpan.FromSeconds(3));
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ using Akka.TestKit.Xunit2;
|
|||||||
using NSubstitute;
|
using NSubstitute;
|
||||||
using ScadaLink.Commons.Interfaces.Protocol;
|
using ScadaLink.Commons.Interfaces.Protocol;
|
||||||
using ScadaLink.Commons.Messages.DataConnection;
|
using ScadaLink.Commons.Messages.DataConnection;
|
||||||
|
using ScadaLink.Commons.Types.Enums;
|
||||||
using ScadaLink.DataConnectionLayer.Actors;
|
using ScadaLink.DataConnectionLayer.Actors;
|
||||||
using ScadaLink.HealthMonitoring;
|
using ScadaLink.HealthMonitoring;
|
||||||
|
|
||||||
@@ -57,6 +58,52 @@ public class DataConnectionManagerActorTests : TestKit
|
|||||||
Assert.Contains("Unknown connection", response.ErrorMessage);
|
Assert.Contains("Unknown connection", response.ErrorMessage);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
[Fact]
|
||||||
|
public async Task DCL002_ConnectionActorCrash_PreservesSubscriptionState()
|
||||||
|
{
|
||||||
|
// Regression test for DataConnectionLayer-002. The supervisor used
|
||||||
|
// Directive.Restart, which discards the connection actor's in-memory
|
||||||
|
// subscription registry — breaking the design doc's "transparent
|
||||||
|
// re-subscribe" guarantee (subscribers are never re-subscribed and sit at
|
||||||
|
// stale quality forever). After the fix the supervisor uses Resume, which
|
||||||
|
// keeps the actor instance and its state across a transient exception.
|
||||||
|
var mockAdapter = Substitute.For<IDataConnection>();
|
||||||
|
mockAdapter.ConnectAsync(Arg.Any<IDictionary<string, string>>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(Task.CompletedTask);
|
||||||
|
mockAdapter.Status.Returns(ConnectionHealth.Connected);
|
||||||
|
mockAdapter.SubscribeAsync(Arg.Any<string>(), Arg.Any<SubscriptionCallback>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns("sub-001");
|
||||||
|
mockAdapter.ReadAsync(Arg.Any<string>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns(new ReadResult(false, null, null));
|
||||||
|
// A write throws synchronously, escaping the message handler and crashing
|
||||||
|
// the connection actor — exercising the supervisor strategy.
|
||||||
|
mockAdapter.WriteAsync(Arg.Any<string>(), Arg.Any<object?>(), Arg.Any<CancellationToken>())
|
||||||
|
.Returns<Task<WriteResult>>(_ => throw new InvalidOperationException("boom"));
|
||||||
|
|
||||||
|
_mockFactory.Create("OpcUa", Arg.Any<IDictionary<string, string>>()).Returns(mockAdapter);
|
||||||
|
|
||||||
|
var manager = Sys.ActorOf(Props.Create(() =>
|
||||||
|
new DataConnectionManagerActor(_mockFactory, _options, _mockHealthCollector)));
|
||||||
|
|
||||||
|
manager.Tell(new CreateConnectionCommand("conn1", "OpcUa", new Dictionary<string, string>(), null, 3));
|
||||||
|
await Task.Delay(300); // connection actor reaches Connected
|
||||||
|
|
||||||
|
// Register a subscription.
|
||||||
|
manager.Tell(new SubscribeTagsRequest("c1", "inst1", "conn1", ["tag1"], DateTimeOffset.UtcNow));
|
||||||
|
ExpectMsg<SubscribeTagsResponse>(TimeSpan.FromSeconds(3));
|
||||||
|
|
||||||
|
// Crash the connection actor via a synchronously-throwing write.
|
||||||
|
manager.Tell(new WriteTagRequest("c2", "conn1", "tag1", 42, DateTimeOffset.UtcNow));
|
||||||
|
await Task.Delay(300); // supervisor handles the failure
|
||||||
|
|
||||||
|
// After the crash the subscription state must survive: the health report
|
||||||
|
// still shows the subscribed/resolved tag. With Restart it would be 0.
|
||||||
|
manager.Tell(new GetAllHealthReports());
|
||||||
|
var report = ExpectMsg<DataConnectionHealthReport>(TimeSpan.FromSeconds(3));
|
||||||
|
Assert.Equal(1, report.TotalSubscribedTags);
|
||||||
|
Assert.Equal(1, report.ResolvedTags);
|
||||||
|
}
|
||||||
|
|
||||||
[Fact]
|
[Fact]
|
||||||
public void CreateConnection_UsesFactory()
|
public void CreateConnection_UsesFactory()
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -6,6 +6,37 @@ using ScadaLink.DataConnectionLayer.Adapters;
|
|||||||
|
|
||||||
namespace ScadaLink.DataConnectionLayer.Tests;
|
namespace ScadaLink.DataConnectionLayer.Tests;
|
||||||
|
|
||||||
|
/// <summary>
|
||||||
|
/// DataConnectionLayer-003: structural regression guard. RealOpcUaClient's
|
||||||
|
/// monitored-item / callback maps are read from the OPC UA SDK's publish threads
|
||||||
|
/// concurrently with subscribe/disconnect mutations on other threads. They must be
|
||||||
|
/// concurrent collections, not plain Dictionary. This is verified structurally
|
||||||
|
/// because RealOpcUaClient wraps concrete OPC Foundation SDK types and cannot be
|
||||||
|
/// exercised without a live OPC UA server.
|
||||||
|
/// </summary>
|
||||||
|
public class RealOpcUaClientThreadSafetyTests
|
||||||
|
{
|
||||||
|
[Theory]
|
||||||
|
[InlineData("_callbacks")]
|
||||||
|
[InlineData("_monitoredItems")]
|
||||||
|
public void DCL003_SharedDictionaryFields_AreConcurrentCollections(string fieldName)
|
||||||
|
{
|
||||||
|
var field = typeof(RealOpcUaClient)
|
||||||
|
.GetField(fieldName,
|
||||||
|
System.Reflection.BindingFlags.Instance |
|
||||||
|
System.Reflection.BindingFlags.NonPublic);
|
||||||
|
|
||||||
|
Assert.NotNull(field);
|
||||||
|
|
||||||
|
var fieldType = field!.FieldType;
|
||||||
|
Assert.True(
|
||||||
|
fieldType.IsGenericType &&
|
||||||
|
fieldType.GetGenericTypeDefinition() == typeof(System.Collections.Concurrent.ConcurrentDictionary<,>),
|
||||||
|
$"RealOpcUaClient.{fieldName} must be a ConcurrentDictionary<,> for thread safety, " +
|
||||||
|
$"but was {fieldType.Name}.");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// <summary>
|
/// <summary>
|
||||||
/// WP-7: Tests for OPC UA adapter.
|
/// WP-7: Tests for OPC UA adapter.
|
||||||
/// </summary>
|
/// </summary>
|
||||||
|
|||||||
Reference in New Issue
Block a user