Solution + 23 src projects + 26 test projects renamed; folders, csproj, namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated. ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated. SQL roles/logins, LDAP domains, CLI command name, and CLI config dir (~/.scadalink → ~/.scadabridge) also renamed. Build green; 5 Host.Tests fail awaiting SQL login rename in next commit. Pre-existing StaleTagMonitor timing flakes unchanged. Rename script committed at tools/rename-to-scadabridge.sh.
73 KiB
Code Review — DataConnectionLayer
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer |
| Design doc | docs/requirements/Component-DataConnectionLayer.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 5 |
Summary
The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
lifecycle state machine, the captured-Self marshalling of background-thread
disconnect events, and the protocol-factory abstraction all follow the design doc
and Akka.NET conventions. However, the review found one critical actor-model
violation — HandleSubscribe spawns a Task.Run that mutates the actor's private
dictionaries and counters from a thread-pool thread, racing with the actor's own
message loop. Several high-severity issues cluster around concurrency and error
handling: the subscription-failure path leaves the connection with degraded subtrees
but no real recovery, the DataConnectionManagerActor's Restart supervision drops
all subscription state on a connection-actor crash, and RealOpcUaClient's monitored-
item callback dictionary is mutated without synchronization while OPC UA notification
threads read it. The remaining findings concern stale health counters after failover,
an unused WriteTimeout option (writes are unbounded despite the design promising a
30 s timeout), ReadBatchAsync aborting mid-batch, and documentation drift between
the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around HandleSubscribe.
Re-review 2026-05-28 (commit 1eb6e97)
The 2026-05-28 re-review walked all 10 checklist categories against the current
source and found 5 new findings. All 17 prior findings remain Resolved and the
fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
were verified in place. The new findings cluster around HandleSubscribe /
HandleSubscribeCompleted race-induced state drift:
- High — concurrent subscribes for the same tag from different instances each see
the tag as not-yet-subscribed (the
alreadySubscribedsnapshot was taken before the Task.Run dispatch), so each Task.Run calls_adapter.SubscribeAsyncand the laterHandleSubscribeCompletedsilently discards the second adapter subscription handle — the orphan never getsUnsubscribeAsync'd. - Medium —
OpcUaDataConnection._subscriptionHandlesis a plainDictionary<,>mutated from thread-pool continuations ofSubscribeAsync/UnsubscribeAsync/DisconnectAsyncrunning in parallel — the same class of bug DCL-003 fixed inRealOpcUaClientbut missed in the layer above. - Medium —
HandleSubscribeCompleted's success branch never checks_unresolvedTags, so a tag that previously failed resolution (incrementing_totalSubscribed) and is then successfully subscribed by a different instance gets_totalSubscribed++a second time, double-counting; meanwhile the unresolved entry lingers until the retry timer also resolves it, creating an orphaned monitored item. - Medium — when an instance is unsubscribed mid-flight,
HandleSubscribeCompletedre-creates an empty_subscriptionsByInstance[name]entry and processes the late results, leaking_tagSubscriberCount/_totalSubscribed/_resolvedTagsincrements for an instance with no_subscribersentry to deliver values to. - Medium —
HandleSubscribeCompletedcallsTimers.StartPeriodicTimeron every completed subscribe with unresolved tags; in Akka.NET,StartPeriodicTimerwith the same key cancels and replaces the existing timer, so a burst of subscribes arriving faster thanTagResolutionRetryInterval(10 s default) keeps resetting the timer and the retry never actually fires.
Re-review 2026-05-17 (commit 39d737e)
All 13 findings from the 2026-05-16 review remain Resolved and the fixes were
verified in place against the current source (PipeTo(Self) subscribe pattern,
Resume supervision, ConcurrentDictionary callback maps, atomic disconnect guards,
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
and found 4 new findings: one High — the DCL-012 security warning is never
seen in production because RealOpcUaClientFactory.Create() constructs
RealOpcUaClient with no logger, so the warning sinks into NullLogger; one
Medium — initial-connect failures in the Connecting state never count toward
failover, so a connection whose primary endpoint is unreachable at startup retries the
primary forever and never tries the configured backup; one Medium —
HandleSubscribeCompleted always replies SubscribeTagsResponse(success: true) even
when a connection-level subscribe failure is driving the actor into Reconnecting,
telling the Instance Actor the subscribe succeeded when it did not; and one Low —
WriteBatchAsync does not catch the InvalidOperationException from EnsureConnected,
so a mid-batch disconnect aborts the whole write batch (the same class of defect
DCL-007 fixed for ReadBatchAsync). New findings are numbered from
DataConnectionLayer-014.
Checklist coverage (2026-05-28 re-review, commit 1eb6e97)
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | x | Findings 020 (double-count _totalSubscribed when a previously-unresolved tag is resolved by a different instance) and 021 (leaked _subscriptionsByInstance entry + counter increments when instance unsubscribes mid-flight). |
| 2 | Akka.NET conventions | x | Finding 022 — Timers.StartPeriodicTimer reset on every HandleSubscribeCompleted for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — OpcUaDataConnection._subscriptionHandles is a plain Dictionary<,> mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true default surfaced in DCL-012 is still present but is out of this module's scope. |
| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
| 9 | Testing coverage | x | DCL001–017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent _subscriptionHandles mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
Checklist coverage (2026-05-17 re-review, commit 39d737e)
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — SubscribeTagsResponse reports success on a connection-level subscribe failure. |
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (PipeTo(Self) subscribe, Resume supervision). Re-review: no new issues. |
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (ConcurrentDictionary, atomic disconnect guards). Re-review: no new issues. |
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — WriteBatchAsync aborts on mid-batch disconnect. |
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (RealOpcUaClient built without a logger). |
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
Findings
DataConnectionLayer-001 — Task.Run in HandleSubscribe mutates actor state off the actor thread
| Severity | Critical |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538 |
Description
HandleSubscribe launches a Task.Run(async () => ...) that runs on a thread-pool
thread and directly mutates the actor's private mutable state: instanceTags (a
reference into _subscriptionsByInstance), _subscriptionIds, _totalSubscribed,
_resolvedTags, and _unresolvedTags. All of these are simultaneously read and
written by the actor's own message loop (HandleTagValueReceived, HandleUnsubscribe,
ReSubscribeAll, HandleRetryTagResolution, ReplyWithHealthReport). This is a
direct violation of the Akka.NET actor model, which guarantees single-threaded access
to actor state only when state is touched on the actor thread. Two concurrent
subscribe requests, or a subscribe overlapping a TagValueReceived / GetHealthReport,
produce data races on Dictionary/HashSet/int — Dictionary is not thread-safe
and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
also produce torn reads of the health counters.
Recommendation
Do not mutate actor state from the background task. Perform only the await _adapter.SubscribeAsync(...) / ReadAsync(...) I/O in the task, collect the results
into a local immutable result object, and PipeTo(Self) an internal message (e.g.
SubscribeCompleted) whose handler — running on the actor thread — applies all state
mutations and counter updates. The response to Sender should be sent from that
handler too.
Resolution
Resolved 2026-05-16. HandleSubscribe was restructured to follow the actor's own
PipeTo(Self) pattern (the one already used by HandleRetryTagResolution): the
background Task.Run now performs only adapter I/O (SubscribeAsync/ReadAsync),
collects per-tag outcomes into an immutable SubscribeCompleted message, and pipes
that to Self. All mutation of _subscriptionIds, _subscriptionsByInstance,
_totalSubscribed, _resolvedTags and _unresolvedTags now happens in the new
HandleSubscribeCompleted handler on the actor thread; it is wired into the
Connected, Connecting and Reconnecting states so an in-flight subscribe is applied
regardless of state transitions. Regression test
DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30×30 concurrent
subscribes) fails against the pre-fix code and passes after. Fixed by the commit
whose message references DataConnectionLayer-001.
DataConnectionLayer-002 — Restart supervision discards all subscription state on connection-actor crash
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141 |
Description
DataConnectionManagerActor.SupervisorStrategy returns a OneForOneStrategy with
Directive.Restart for DataConnectionActor failures. On restart, Akka.NET creates a
fresh actor instance, so all in-memory fields — _subscriptionsByInstance,
_subscriptionIds, _subscribers, _unresolvedTags, the quality counters — are
silently discarded. The actor re-enters Connecting with zero subscriptions, and the
design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
that had subscribed before the crash never get their tags re-subscribed and will sit
at uncertain/stale quality indefinitely with no error returned. There is no durable
subscription store from which a restarted actor could rebuild state.
Recommendation
Either (a) make the subscription registry durable/recoverable so a restarted actor
can rebuild it (persist to local SQLite as the design doc says connection definitions
are, and have PreStart reload subscriptions), or (b) treat a connection-actor crash
as a lifecycle event the DataConnectionManagerActor notices, so it can re-issue the
subscription registrations. At minimum document that subscribers must re-register
after a crash and surface the lost-state condition rather than failing silently.
Resolution
Resolved 2026-05-16. The DataConnectionManagerActor.SupervisorStrategy was changed
from Directive.Restart to Directive.Resume for DataConnectionActor failures.
Resume keeps the existing actor instance and all its in-memory subscription state
(_subscriptionsByInstance, _subscriptionIds, _subscribers, quality counters)
intact across a transient handler exception, so the design doc's "transparent
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
coordinator and its own Become/Stash reconnect state machine already recovers
connection-level faults — it does not need a restart. This also aligns with the
ScadaBridge convention of Resume for coordinator actors. Regression test
DCL002_ConnectionActorCrash_PreservesSubscriptionState crashes the connection actor
via a synchronously-throwing write and asserts the subscription survives (health
report still shows 1 subscribed/resolved tag); it fails against the pre-fix Restart
code and passes after. Fixed by the commit whose message references
DataConnectionLayer-002 (commit <pending>).
DataConnectionLayer-003 — RealOpcUaClient callback/monitored-item dictionaries mutated without synchronization
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184 |
Description
_monitoredItems and _callbacks are plain Dictionary<,> instances. They are
written from CreateSubscriptionAsync / RemoveSubscriptionAsync (invoked from the
DataConnectionActor's Task.Run / ContinueWith continuations, i.e. thread-pool
threads) and from DisconnectAsync (.Clear()), while being read concurrently from
the OPC Foundation SDK's MonitoredItem.Notification event handler, which fires on
the SDK's internal publish threads (_callbacks.TryGetValue(handle, ...) at line
163). Concurrent reads during a Dictionary resize or Clear() are undefined
behaviour — they can throw InvalidOperationException, return wrong entries, or
corrupt the dictionary. The DataConnectionActor's subscribe path already runs off
the actor thread (finding 001), so multiple subscribe calls can also race each other
here.
Recommendation
Use ConcurrentDictionary<,> for _monitoredItems and _callbacks, or guard all
access with a lock. Note that fixing finding 001 (serialising subscribe through the
actor thread) reduces but does not eliminate the race, because the SDK notification
threads still read _callbacks concurrently with RemoveSubscriptionAsync /
DisconnectAsync.
Resolution
Resolved 2026-05-16. _monitoredItems and _callbacks in RealOpcUaClient were
changed from plain Dictionary<,> to ConcurrentDictionary<,>, and the two
Remove(key) call sites switched to TryRemove. This makes the maps safe to read
from the OPC Foundation SDK's publish threads (MonitoredItem.Notification reading
_callbacks) concurrently with subscribe/disconnect mutations on other threads.
RealOpcUaClient wraps concrete OPC Foundation SDK types (ISession,
Subscription, MonitoredItem) and cannot be exercised without a live OPC UA
server, so the regression is guarded structurally by
DCL003_SharedDictionaryFields_AreConcurrentCollections (a reflection test asserting
both fields are ConcurrentDictionary<,>); it fails against the pre-fix Dictionary
code and passes after. Fixed by the commit whose message references
DataConnectionLayer-003 (commit <pending>).
DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537 |
Description
When _adapter.SubscribeAsync throws inside the HandleSubscribe background task,
the catch block adds the tag to _unresolvedTags and increments _totalSubscribed,
treating every subscribe exception as a tag-resolution failure. But SubscribeAsync
also throws InvalidOperationException from EnsureConnected() when the OPC UA
client is not connected, and throws on transport faults — these are connection
problems, not bad tag paths. They get misclassified as unresolved tags and retried on
the 10 s tag-resolution timer instead of triggering the reconnection state machine.
Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
must be marked quality bad; the code never pushes a bad-quality update to the
subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
stays at uncertain quality with no signal. The TagResolutionFailed message it sends
to Self only logs and re-arms the timer (HandleTagResolutionFailed).
Recommendation
Distinguish connection-level exceptions (raise AdapterDisconnected / let the
reconnect machine handle them) from genuine node-not-found errors. For genuine
resolution failures, push a TagValueUpdate with QualityCode.Bad to the subscribing
Instance Actor so it reflects the documented behaviour.
Resolution
Resolved 2026-05-16. The subscribe background task now classifies each subscribe
exception via the new IsConnectionLevelFailure helper (InvalidOperationException
— thrown by EnsureConnected() — plus SocketException/TimeoutException/
IOException count as connection-level; anything else is a genuine resolution
failure). The classification is carried on SubscribeTagResult.ConnectionLevelFailure
and applied on the actor thread in HandleSubscribeCompleted: connection-level
failures no longer become unresolved tags and instead drive the reconnection state
machine (HandleSubscribeCompleted returns a flag and the Connected-state handler
calls BecomeReconnecting); genuine resolution failures still go to _unresolvedTags
and the retry timer, and now also push a TagValueUpdate with QualityCode.Bad to
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
Regression tests DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber
and DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry both fail
against the pre-fix code and pass after. Fixed by the commit whose message references
DataConnectionLayer-004 (commit <pending>).
DataConnectionLayer-005 — WriteTimeout option is documented and configured but never applied
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/DataConnectionOptions.cs:15, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590 |
Description
DataConnectionOptions.WriteTimeout (default 30 s) and the design doc's "Shared
Settings" table both promise a bounded timeout for synchronous device writes. The
value is never read anywhere in the module (grep confirms only the declaration).
HandleWrite calls _adapter.WriteAsync(request.TagPath, request.Value) with no
CancellationToken and no timeout. If the OPC UA server hangs (TCP black-hole, no
RST), the write Task never completes, PipeTo(sender) never fires, and the calling
script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
error. The design states write failures (including timeout) must be returned
synchronously to the script; an unbounded write violates that.
Recommendation
Create a CancellationTokenSource(_options.WriteTimeout), pass its token to
WriteAsync, and in the continuation translate cancellation into a failed
WriteTagResponse with a timeout error message. Apply the same to the read used by
the initial-value seed and to WriteBatchAndWaitAsync paths if they are reachable.
Resolution
Resolved 2026-05-16. HandleWrite now creates a CancellationTokenSource(_options.WriteTimeout),
passes its token to _adapter.WriteAsync(...), and disposes the source in the
continuation. A cancelled/timed-out write (Task.IsCanceled or a base
OperationCanceledException) is translated into a failed WriteTagResponse with a
"Write timeout after Ns" message, so a hung device write is bounded and the failure
is returned synchronously to the calling script (WP-11) instead of blocking until the
script's own Ask-timeout. (The WriteBatchAndWaitAsync adapter path already accepts
an explicit timeout/CancellationToken and is not invoked by HandleWrite, so no
change was needed there.) Regression test
DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously uses an adapter whose
WriteAsync only completes when its token fires; it fails against the pre-fix
unbounded code and passes after. Fixed by the commit whose message references
DataConnectionLayer-005 (commit <pending>).
DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756 |
Description
ReSubscribeAll resets _subscriptionIds, _unresolvedTags and _resolvedTags to a
clean slate, but leaves _lastTagQuality, _tagsGoodQuality, _tagsBadQuality and
_tagsUncertainQuality untouched. PushBadQualityForAllTags (called on disconnect)
sets _tagsBadQuality = _lastTagQuality.Count and zeroes the others. After a
reconnect, HandleTagValueReceived decrements the old bucket using
_lastTagQuality's value and increments the new one — but tags resolved for the first
time after reconnect were never in _lastTagQuality, so they only increment, never
decrement, and the totals can drift above _totalSubscribed. Over repeated
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
unreliable.
Verification note: Confirmed against source. The root cause is broader than the
reconnect path the finding describes: HandleUnsubscribe also never removes a tag
from _lastTagQuality nor decrements its quality bucket, so an unsubscribed tag
lingers and PushBadQualityForAllTags (which sets _tagsBadQuality = _lastTagQuality.Count) over-counts it — driving the bad-quality count above
_totalSubscribed even without a re-subscribe. Both the unsubscribe leak and the
re-subscribe drift are real.
Recommendation
On BecomeConnected after a re-subscribe (or in ReSubscribeAll), clear
_lastTagQuality and the three quality counters and let them be repopulated from
fresh TagValueReceived messages. Alternatively recompute the buckets from
_lastTagQuality whenever it changes rather than maintaining incremental counters.
Resolution
Resolved 2026-05-16 (commit pending). HandleUnsubscribe now removes each
unsubscribed tag from _lastTagQuality and decrements the corresponding quality
bucket, then reports the corrected counters via UpdateTagQuality/UpdateTagResolution;
ReSubscribeAll clears _lastTagQuality and zeroes the three quality counters so
post-reconnect tags are repopulated from fresh TagValueReceived messages instead of
only incrementing. Regression test
DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags subscribes two
tags, pushes Good values, unsubscribes one, then disconnects and asserts
PushBadQualityForAllTags reports exactly 1 bad tag (the reconnect is gated open so
ReSubscribeAll does not run before the assertion); it reports 2 against the pre-fix
code and 1 after.
DataConnectionLayer-007 — ReadBatchAsync aborts the whole batch on the first failing tag
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195 |
Description
ReadBatchAsync loops calling ReadAsync per tag. ReadAsync re-throws any
non-cancellation exception (line 184). So if any single tag in the batch throws (bad
node, transient fault), the entire ReadBatchAsync throws and the caller gets no
results for the tags that did read successfully — even though ReadResult already
has a Success/ErrorMessage shape designed to carry per-tag failures. The batch is
also fully serial (one round-trip per tag), defeating the point of a batch API; the
design doc lists ReadBatch/WriteBatch as first-class operations.
Verification note: Confirmed against source — ReadAsync re-throws on any
non-OperationCanceledException, aborting the whole batch.
Recommendation
Catch per-tag exceptions inside the loop and store a failed ReadResult for that tag
so the batch returns a complete map. Ideally issue a single OPC UA Read service call
for all node IDs (RealOpcUaClient.ReadValueAsync already builds a
ReadValueIdCollection — extend it to accept multiple nodes).
Resolution
Resolved 2026-05-16 (commit pending). ReadBatchAsync now wraps each per-tag
ReadAsync in a try/catch: a per-tag exception is recorded as a failed ReadResult
(Success: false, message = the exception message) so the batch returns a complete
result map for every requested tag; OperationCanceledException is still propagated
so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call
optimisation were deliberately left for a follow-up — they are a performance concern,
not the correctness bug this finding raised. Regression test
DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails reads three tags where the
middle one throws and asserts all three appear in the result map with the failing one
marked unsuccessful; it threw (no map returned) against the pre-fix code and passes
after.
DataConnectionLayer-008 — HandleUnsubscribe is O(n^2) over instances and rechecks _unresolvedTags redundantly
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569 |
Description
For each tag of the instance being removed, HandleUnsubscribe scans every other
instance's tag set (_subscriptionsByInstance.Where(...).Any()), making the operation
O(tags x instances). On a site with many instances sharing a connection this is
needlessly expensive on every instance stop/redeploy. Separately, line 562
re-evaluates !_unresolvedTags.Contains(tagPath) immediately after line 561 already
removed tagPath from _unresolvedTags, so the condition is always true — dead
logic that obscures intent (the decrement of _resolvedTags is unconditional in
practice).
Recommendation
Maintain a reference count per tag path (or a tagPath -> set<instance> reverse index)
so the "any other subscriber" check is O(1). Remove the redundant _unresolvedTags
re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
prior state captured before removal.
Resolution
Resolved 2026-05-16 (commit pending). A tagPath → subscriber-count reverse index
(_tagSubscriberCount) was added: HandleSubscribeCompleted increments it whenever a
tag is newly added to an instance's set, and HandleUnsubscribe decrements it,
releasing a tag at the adapter only when the count reaches zero. The "any other
subscriber" check is now O(1) per tag instead of an O(instances) Where(...).Any()
scan. The redundant !_unresolvedTags.Contains(tagPath) re-check (always true after
the unconditional Remove on the line above) was removed — the surviving branch is
entered only for tags that have a subscription id, which are by definition resolved,
so _resolvedTags-- is now unconditional with an explanatory comment. The cleanup
also fixed a latent leak the original code could not reach: an unresolved tag whose
last subscriber unsubscribes is now removed from _unresolvedTags/_resolutionInFlight
and decremented from _totalSubscribed (previously it lingered in the retry timer and
the subscribed total forever). Regression test
DCL008_Unsubscribe_OnlyReleasesTagWhenLastSubscriberLeaves subscribes a tag from two
instances plus an exclusive tag, then unsubscribes each instance and asserts the
shared tag is released at the adapter only after the last subscriber leaves and the
health counters track correctly. (This finding is a performance refactor, not a
correctness bug — the pre-fix Where(...).Any() logic was functionally correct, so
the test passes against both versions and serves as a behavioural guard for the
refactor.)
DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine
| Severity | Medium — partially design-doc work outside this module's editable scope |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449, docs/requirements/Component-DataConnectionLayer.md:73-85 |
Description
The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
(>= FailoverRetryCount) -> switch to other endpoint". The code implements two separate
failover triggers: (a) HandleReconnectResult counts _consecutiveFailures on
connect-attempt failures (matches the doc), and (b) BecomeReconnecting additionally
counts _consecutiveUnstableDisconnects — connections that succeeded but dropped
within a hard-coded 60 s StableConnectionThreshold — and fails over on that count
too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
happen on successful-but-flaky connections are not described in the component doc at
all. A reviewer or operator reading Component-DataConnectionLayer.md would not
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
DataConnectionOptions.
Verification note: Confirmed against source. The hard-coded
StableConnectionThreshold = TimeSpan.FromSeconds(60) static readonly field and the
_consecutiveUnstableDisconnects failover path both exist as described.
Recommendation
Update Component-DataConnectionLayer.md to document the unstable-disconnect failover
path and the stability threshold, and move the 60 s threshold into
DataConnectionOptions so it is configurable and consistent with the other tunables.
Resolution
Resolved 2026-05-16 (commit pending). The configurability half of the recommendation
is done: the hard-coded StableConnectionThreshold constant was removed from
DataConnectionActor and replaced with a new DataConnectionOptions.StableConnectionThreshold
property (60 s default), bindable from the DataConnectionLayer appsettings.json
section like ReconnectInterval/TagResolutionRetryInterval/WriteTimeout. Regression
test DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault guards
the default and the setter. The documentation half is out of this module's editable
scope — docs/requirements/Component-DataConnectionLayer.md (lines 73-85) still
describes only the connect-failure failover path and does not mention the
unstable-disconnect trigger. Action required (surfaced): the DCL design doc should
be updated to document the unstable-disconnect failover path and the configurable
stability threshold; that edit was deliberately not made here because this task is
scoped to src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer, tests, and this findings file only.
DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703 |
Description
HandleRetryTagResolution fires SubscribeAsync for every tag in _unresolvedTags
via ContinueWith(...).PipeTo(self), but does not remove the tags from
_unresolvedTags while the attempts are in flight. Because tags are not removed
before the retry, a slow SubscribeAsync overlapping the next 10 s tick issues
duplicate concurrent subscribe attempts for the same tag, which can create duplicate
monitored items / leaked subscription IDs (the second success overwrites
_subscriptionIds[tag] in HandleTagResolutionSucceeded, orphaning the first handle
with no UnsubscribeAsync call). The timer-cancel condition in
HandleTagResolutionSucceeded is also non-deterministic for the same reason.
Verification note: Confirmed against source — HandleRetryTagResolution dispatched
SubscribeAsync for every tag in _unresolvedTags on every tick with no in-flight
guard.
Recommendation
Remove tags from _unresolvedTags (into an "in-flight" set) when a retry is
dispatched, and only put them back on failure. This prevents overlapping duplicate
subscribe attempts and makes the timer-cancel condition deterministic.
Resolution
Resolved 2026-05-16 (commit pending). A new _resolutionInFlight HashSet<string>
tracks tags whose retry SubscribeAsync is currently outstanding.
HandleRetryTagResolution now dispatches only for unresolved tags not already in
flight (and skips entirely if all are in flight), adding each dispatched tag to the
set; HandleTagResolutionSucceeded and HandleTagResolutionFailed remove the tag
from the set when its attempt completes, and HandleUnsubscribe/ReSubscribeAll
clear stale entries. This prevents overlapping duplicate subscribe attempts and the
resulting orphaned monitored items. Regression test
DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes gives a tag a
genuine initial failure then a retry SubscribeAsync that never completes, lets six
100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total
subscribe calls); the pre-fix code dispatched on every tick (6 total).
DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262 |
Description
On failover the actor disposes the old adapter (_adapter.DisposeAsync(),
fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
captured self and tagPath and Tell TagValueReceived to the actor. While the
Reconnecting handler ignores TagValueReceived (line 334), once the actor reaches
Connected again it processes them — and a disposed adapter whose OPC UA SDK threads
have not yet fully torn down could still deliver a value, mixing pre-failover device
data with the new endpoint's data and briefly reporting a value the active endpoint
never produced. There is no per-adapter generation/epoch tag on TagValueReceived to
distinguish current from stale callbacks.
Verification note: Confirmed against source — TagValueReceived carried no
adapter identity, and HandleTagValueReceived (reachable in Connected) processed
any such message regardless of which adapter produced it.
Recommendation
Add an adapter-generation counter incremented on every adapter swap; stamp it onto
TagValueReceived (captured in the callback closure) and drop messages whose
generation does not match the current adapter in HandleTagValueReceived.
Resolution
Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new
_adapterGeneration int field is incremented at both adapter-swap sites (the
unstable-disconnect failover in BecomeReconnecting and the connect-failure failover
in HandleReconnectResult). The TagValueReceived record gained an
AdapterGeneration field; every subscription callback closure (HandleSubscribe, the
initial-read seed, HandleRetryTagResolution, ReSubscribeAll) captures the
generation in effect at subscribe time and stamps it onto each TagValueReceived.
HandleTagValueReceived drops any message whose generation no longer matches the
current adapter, so a callback fired by a disposed adapter after failover cannot reach
an Instance Actor. Regression test
DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover subscribes on the
primary, fails over to the backup, then invokes the captured primary callback with a
stale value and asserts the subscriber receives nothing; the stale value reached the
subscriber against the pre-fix code and is dropped after.
DataConnectionLayer-012 — AutoAcceptUntrustedCerts defaults to true, accepting any server certificate
| Severity | Medium — full secure default also requires a Commons + design-doc change outside this module |
| Category | Security |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/IOpcUaClient.cs:17, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61, docs/requirements/Component-DataConnectionLayer.md:116 |
Description
OpcUaConnectionOptions.AutoAcceptUntrustedCerts defaults to true, and
RealOpcUaClient.ConnectAsync wires CertificateValidator.CertificateValidation += (_, e) => e.Accept = true
when it is set. With the default, every server certificate is accepted unconditionally
— there is no certificate-pinning or trust-store enforcement — which defeats the
Sign/SignAndEncrypt security modes against an active man-in-the-middle on the OPC
UA link. The design doc explicitly lists true as the default. For an industrial
control link this is a meaningful exposure; a secure-by-default posture would reject
untrusted certs unless an operator opts in per connection.
Verification note: Confirmed against source. Note the authoritative runtime
default does not actually live on OpcUaConnectionOptions — for a real connection
OpcUaDataConnection.ConnectAsync builds OpcUaConnectionOptions from
OpcUaEndpointConfig (in ZB.MOM.WW.ScadaBridge.Commons), whose AutoAcceptUntrustedCerts
property also defaults to true. OpcUaConnectionOptions' own default is only the
fallback used when an OpcUaConnectionOptions is constructed directly.
Recommendation
Default AutoAcceptUntrustedCerts to false and require explicit per-connection
opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
installed. Update the design doc to reflect the secure default.
Resolution
Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation
are done: (1) OpcUaConnectionOptions.AutoAcceptUntrustedCerts now defaults to
false; (2) RealOpcUaClient.ConnectAsync logs a prominent ILogger warning
whenever the auto-accept certificate validator is installed (an ILogger<RealOpcUaClient>
was added as an optional constructor parameter, defaulting to NullLogger, so
existing callers are unaffected). Regression test
DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse guards the
new secure default. Two parts remain outside this module's editable scope and are
surfaced as action required: (a) ZB.MOM.WW.ScadaBridge.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts
still defaults to true — since that is the value actually used for a real connection
(see verification note above), the Commons default must also be flipped to false
for the system to be secure-by-default; (b) docs/requirements/Component-DataConnectionLayer.md
line 116 still documents true as the default and must be updated. Both edits were
deliberately not made here because this task is scoped to
src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer, tests, and this findings file only.
DataConnectionLayer-013 — Misleading XML comment: RaiseDisconnected claims thread safety it does not provide
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281 |
Description
The XML doc on RaiseDisconnected states "Thread-safe: only the first caller triggers
the event." The implementation is a non-atomic check-then-set on a volatile bool
(if (_disconnectFired) return; _disconnectFired = true;). volatile guarantees
visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
OnClientConnectionLost and a ReadAsync failure path) can both observe
_disconnectFired == false and both invoke Disconnected. In practice the
DataConnectionActor tolerates a duplicate AdapterDisconnected message, so impact
is low, but the comment overstates the guarantee. The same pattern exists in
RealOpcUaClient.OnSessionKeepAlive (_connectionLostFired).
Recommendation
Either make the guard atomic (Interlocked.Exchange with an int flag, or a lock),
or correct the comment to say "best-effort once-only; a duplicate event is possible
under a race and is tolerated downstream."
Resolution
Resolved 2026-05-16 (commit pending). Rather than weaken the XML comment to match the
weak guard, the guard was made genuinely atomic so the documented "only the first
caller fires the event" guarantee becomes true. OpcUaDataConnection._disconnectFired
and RealOpcUaClient._connectionLostFired were changed from volatile bool to int,
and the check-then-set in RaiseDisconnected / OnSessionKeepAlive replaced with a
single Interlocked.Exchange(ref flag, 1) != 0 compare-and-set; the reset on connect
uses Interlocked.Exchange(ref flag, 0). The XML comments on both methods were updated
to describe the atomic compare-and-set explicitly. Regression test
DCL013_ConcurrentConnectionLost_RaisesDisconnectedExactlyOnce runs 25 rounds, each
fanning 32 barrier-synchronised threads that raise the client's ConnectionLost event
simultaneously, and asserts Disconnected fires exactly once per round; against a
non-atomic check-then-set it double-fires (verified by temporarily reverting the
guard), and it passes against the atomic fix.
DataConnectionLayer-014 — DCL-012 security warning is never logged in production: RealOpcUaClient is created without a logger
| Severity | High |
| Category | Security |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83 |
Description
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
ILogger warning in RealOpcUaClient.ConnectAsync whenever the auto-accept
certificate validator is installed (RealOpcUaClient.cs:79-83). The
ILogger<RealOpcUaClient> constructor parameter was made optional, defaulting to
NullLogger<RealOpcUaClient>.Instance (RealOpcUaClient.cs:35-39).
However, the only production code path that constructs a RealOpcUaClient is
RealOpcUaClientFactory.Create() (RealOpcUaClient.cs:325), which calls
new RealOpcUaClient(_globalOptions) and passes no logger. The factory itself
holds only an OpcUaGlobalOptions and has no ILoggerFactory/ILogger available.
As a result the _logger field is always NullLogger for every real OPC UA
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
discarded. An operator who deploys a connection with AutoAcceptUntrustedCerts
enabled — accepting any server certificate on an industrial control link — gets no
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
therefore not actually effective in production; only the unit test
(DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse, which only
checks the default value) passes.
Recommendation
Thread a real logger through to RealOpcUaClient. DataConnectionFactory already
holds an ILoggerFactory and constructs RealOpcUaClientFactory(globalOptions) —
give RealOpcUaClientFactory an ILoggerFactory (or an ILogger<RealOpcUaClient>)
constructor parameter and pass _loggerFactory.CreateLogger<RealOpcUaClient>() into
each new RealOpcUaClient(...). Add a test that asserts the warning is emitted on a
real connect with auto-accept enabled (e.g. via a captured ILogger), not just that
the default is false.
Resolution
Resolved 2026-05-17 (commit pending). RealOpcUaClientFactory gained an
ILoggerFactory constructor parameter and Create() now passes
_loggerFactory.CreateLogger<RealOpcUaClient>() into every RealOpcUaClient it builds;
DataConnectionFactory (which already holds an ILoggerFactory) now constructs
RealOpcUaClientFactory(globalOptions, _loggerFactory), so the DCL-012 auto-accept-cert
warning reaches a real logger in production instead of being discarded by NullLogger.
The parameterless / single-arg factory constructors still default to NullLoggerFactory
so existing callers are unaffected. Regression tests
DCL014_RealOpcUaClientFactory_CreatesClientWithRealLogger and
DCL014_DataConnectionFactory_ThreadsLoggerToRealOpcUaClient fail against the pre-fix
code (the first fails to compile — no logger ctor; the second observes a NullLogger)
and pass after.
DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493 |
Description
Failover between the primary and backup endpoints is implemented in two places, both
reachable only after a connection has already been Connected at least once:
HandleReconnectResult (Reconnecting state) counts _consecutiveFailures and switches
endpoint, and BecomeReconnecting counts _consecutiveUnstableDisconnects.
HandleConnectResult — the handler for the initial connection attempt in the
Connecting state (DataConnectionActor.cs:404-417) — does neither. On failure it
only logs and re-arms the reconnect timer with AttemptConnect; it never increments
_consecutiveFailures, never consults _backupConfig, and never switches endpoint.
Consequence: if the primary endpoint is unreachable when the connection actor first
starts — which is the common case after a fresh artifact deployment, a site restart,
or a primary that is simply down at that moment — the actor retries the primary
endpoint indefinitely at ReconnectInterval and never attempts the configured
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
active endpoint becomes unreachable") is silently not honoured for the
never-connected-yet case, and an operator sees a connection stuck Connecting forever
despite a healthy backup being configured.
Recommendation
Make HandleConnectResult participate in the failover counter the same way
HandleReconnectResult does: increment _consecutiveFailures on failure and, when
_backupConfig != null && _consecutiveFailures >= _failoverRetryCount, perform the
endpoint switch (dispose adapter, create the other adapter, bump _adapterGeneration,
log the failover event) before re-arming the timer. Alternatively, fold the initial
connect into the same reconnect path so there is a single failover decision point. Add
a regression test for "primary down at startup, backup configured → fails over to
backup".
Resolution
Resolved 2026-05-17 (commit pending). The endpoint-switch failover logic was extracted
from HandleReconnectResult into a shared CountFailureAndMaybeFailover helper, and
HandleConnectResult (the initial-connect handler in the Connecting state) now
increments _consecutiveFailures and calls that helper on failure — so a primary that
is unreachable at startup fails over to the configured backup after
FailoverRetryCount attempts instead of retrying the primary forever. Single-endpoint
connections (no backup) still retry indefinitely. Regression test
DCL015_PrimaryDownAtStartup_FailsOverToBackup fails against the pre-fix code (the
backup adapter is never created) and passes after; DCL015_SingleEndpointDownAtStartup_RetriesIndefinitely_NoFailover
guards the no-backup path.
DataConnectionLayer-016 — HandleSubscribeCompleted reports SubscribeTagsResponse success even on a connection-level subscribe failure
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240 |
Description
HandleSubscribeCompleted computes connectionLevelFailure (line 606) and returns it
so the Connected-state handler can drive the actor into Reconnecting
(DataConnectionActor.cs:232-240). But before returning, it unconditionally replies
to the caller with new SubscribeTagsResponse(..., true, null, ...) (lines 666-667) —
Success: true, Error: null — regardless of whether any tag failed at connection
level.
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
told the subscribe succeeded, while the connection actor simultaneously transitions
to Reconnecting. The tags were never actually subscribed at the adapter (the catch
block recorded Success: false); they are recovered later by ReSubscribeAll only if
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
healthy" from "accepted, but the connection is currently down" — a misleading
success signal on a request that did not do what the response claims.
(Genuine tag-resolution failures are arguably also reported as overall true, but
that is defensible: those tags are tracked in _unresolvedTags and the design models
unresolved tags as a runtime quality concern, with a Bad-quality TagValueUpdate
already pushed. The connection-level case is the clear defect because the actor itself
treats it as a failure worth a state transition.)
Recommendation
When connectionLevelFailure is true, reply with
SubscribeTagsResponse(..., success: false, error: "connection unavailable — will re-subscribe on reconnect", ...) (or an equivalent), so the caller's response matches
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
Instance Actor can reflect partial success. Add a test asserting the response is not
Success: true when a connection-level subscribe failure drives Reconnecting.
Resolution
Resolved 2026-05-17 (commit pending). HandleSubscribeCompleted now replies
SubscribeTagsResponse(success: false, error: "connection unavailable — will re-subscribe on reconnect") when connectionLevelFailure is true — matching the
actor's own assessment as it drives into Reconnecting — instead of unconditionally
replying success: true. Genuine tag-resolution failures still reply success: true
(they are a runtime quality concern tracked via _unresolvedTags, with a Bad-quality
TagValueUpdate already pushed), so that case is not regressed. Regression test
DCL016_ConnectionLevelSubscribeFailure_RepliesWithUnsuccessfulResponse fails against
the pre-fix code (it returned Success: true) and passes after;
DCL016_GenuineResolutionFailure_StillRepliesSuccess guards the resolution-failure path.
DataConnectionLayer-017 — WriteBatchAsync aborts the whole batch on a mid-batch disconnect
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227 |
Description
WriteBatchAsync loops calling WriteAsync per tag (OpcUaDataConnection.cs:229-237).
WriteAsync returns a WriteResult for OPC-UA-level write rejections (good — a bad
status does not abort the batch), but it first calls EnsureConnected()
(OpcUaDataConnection.cs:220), which throws InvalidOperationException when the
client is disconnected. WriteBatchAsync does not catch that exception, so if the
connection drops partway through a batch the whole WriteBatchAsync throws and the
caller gets no result map — losing the per-tag outcomes for the tags that already
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
ReadBatchAsync (which now records a failed ReadResult per failing tag and only
propagates OperationCanceledException). WriteBatchAsync feeds
WriteBatchAndWaitAsync (line 246), so a disconnect during a flag-and-wait write
sequence surfaces as an unhandled exception rather than a clean false/per-tag result.
Severity is Low because device writes are real-time control operations with no
store-and-forward, the batch write paths are not on the primary HandleWrite hot path
(HandleWrite calls single-tag WriteAsync), and a disconnect mid-batch is itself an
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
maintainability and correctness wart.
Recommendation
Mirror the DCL-007 fix: wrap the per-tag WriteAsync call in WriteBatchAsync in a
try/catch that records a failed WriteResult(false, ex.Message) for the failing tag
and continues, while still propagating OperationCanceledException to abort a
cancelled batch as a whole. This gives callers (including WriteBatchAndWaitAsync) a
complete, consistent result map.
Resolution
Resolved 2026-05-17 (commit pending). WriteBatchAsync now wraps each per-tag
WriteAsync call in a try/catch — mirroring the DCL-007 fix for ReadBatchAsync — so a
mid-batch fault (the InvalidOperationException from EnsureConnected on a dropped
connection) is recorded as a failed WriteResult(false, ex.Message) for that tag and
the batch returns a complete result map; OperationCanceledException is still
propagated so a cancelled batch aborts as a whole. Callers (including
WriteBatchAndWaitAsync) now get a consistent per-tag result shape instead of an
unhandled exception. Regression test
DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch fails against the
pre-fix code (the batch throws, no map returned) and passes after;
DCL017_WriteBatch_CancellationAbortsWholeBatch guards that cancellation still aborts.
DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653 |
Resolution — added a _subscribesInFlight HashSet mirroring the
existing _resolutionInFlight pattern. HandleSubscribe now partitions
each request's tags on the actor thread into "this request will
SubscribeAsync" vs. "already subscribed by us OR by another in-flight
request"; the second arrival sees the tag in _subscribesInFlight and
treats it as AlreadySubscribed: true without issuing a duplicate
adapter call. HandleSubscribeCompleted removes each
non-AlreadySubscribed result from the set; ReSubscribeAll clears it on
reconnect. Regression test
DCL018_ConcurrentSubscribes_SameTag_DifferentInstances_IssueOneAdapterSubscribe
parks the first subscribe in flight, fires a second for the same tag,
and asserts exactly one adapter SubscribeAsync call.
Description
HandleSubscribe snapshots _subscriptionIds.Keys into a local alreadySubscribed
set on the actor thread before dispatching the Task.Run that performs the adapter
I/O (line 557). The snapshot is the only basis on which the background task decides
whether to call _adapter.SubscribeAsync — and it is taken once, before the I/O
runs.
If two SubscribeTagsRequest messages arrive on the actor thread for different
instances that both reference the same tag path, both HandleSubscribe invocations
take a snapshot at a time when neither subscribe has completed, so alreadySubscribed
does not contain the shared tag in either snapshot. Both background tasks then call
_adapter.SubscribeAsync(tagPath, ...), the adapter creates two monitored items
and returns two distinct subscription ids, and each task pipes a SubscribeCompleted
back to the actor with AlreadySubscribed: false, Success: true.
HandleSubscribeCompleted for the first message takes the success branch and writes
_subscriptionIds[tagPath] = subId1. The second message arrives, hits the
"already in _subscriptionIds" guard at line 653 (_subscriptionIds.ContainsKey(...))
and continues — but result.SubscriptionId (the orphan handle for the second
adapter subscription) is silently discarded. The orphan monitored item stays alive in
the OPC UA session for the lifetime of the adapter, sending duplicate data-change
notifications (whose callbacks were stamped with the captured generation) into
HandleTagValueReceived for every value change. Across a deploy that creates many
instances sharing a few tags, this leaks N-1 monitored items per shared tag and
doubles/triples the per-tag publish traffic.
DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
via _resolutionInFlight; the equivalent guard is missing on the user-subscribe
path.
Recommendation
Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
HashSet<string> _subscribesInFlight and add tagPath to it on the actor thread
before the Task.Run dispatch, only for tags not already in
_subscriptionIds and not already in _subscribesInFlight. Tags that are already
in flight should produce a SubscribeTagResult(..., AlreadySubscribed: true, ...)
without touching the adapter. Remove from _subscribesInFlight in
HandleSubscribeCompleted once the result is applied. Add a regression test that
fans two simultaneous SubscribeTagsRequest messages for the same tag and asserts
exactly one _adapter.SubscribeAsync(tag, ...) call (and no orphan subscription id).
DataConnectionLayer-019 — OpcUaDataConnection._subscriptionHandles is a plain Dictionary<,> mutated from concurrent thread-pool continuations
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177, src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164 |
Resolution — deleted the dead _subscriptionHandles field outright.
Subscription bookkeeping lives in RealOpcUaClient._monitoredItems /
_callbacks (already ConcurrentDictionary per DCL-003) at the device
layer and DataConnectionActor._subscriptionIds at the actor layer;
the adapter had no live reader and the field was a latent
race-condition trap. Added structural regression test
DCL019_OpcUaDataConnection_HasNoNonConcurrentSharedDictionary that
reflects over the adapter's fields and fails if any plain
Dictionary<,> is reintroduced.
Description
OpcUaDataConnection._subscriptionHandles is declared as Dictionary<string, string>. It is mutated from:
SubscribeAsync(line 167):_subscriptionHandles[subscriptionId] = tagPath;after anawait _client!.CreateSubscriptionAsync(...)— i.e. the assignment executes on the continuation thread (a thread-pool thread).UnsubscribeAsync(line 177):_subscriptionHandles.Remove(subscriptionId);similarly after anawait.DisconnectAsyncindirectly via the underlying_client.DisconnectAsyncdoes not touch_subscriptionHandles, but multipleSubscribeAsync/UnsubscribeAsynccalls can run in parallel from the upper layer.
The DCL upper layer calls _adapter.SubscribeAsync from multiple places that all
run off the actor thread:
DataConnectionActor.HandleSubscribeinside itsTask.Run(multiple invocations can run in parallel — see DCL-018);HandleRetryTagResolutionissues_adapter.SubscribeAsyncfor every tag in_unresolvedTagsand pipes the continuation (each subscribe runs concurrently via the SDK's async machinery);ReSubscribeAlldoes the same after a reconnect.
So plain-Dictionary mutations occur on multiple thread-pool threads concurrently —
the exact pattern DCL-003 fixed by switching RealOpcUaClient._monitoredItems and
_callbacks to ConcurrentDictionary<,>. Plain Dictionary mutations during a
concurrent resize are undefined behaviour: they can throw
InvalidOperationException, corrupt the internal hash buckets, or lose entries.
This is _subscriptionHandles is currently dead state (the dictionary is written to
and Removed but never read), so a corruption today would not crash the
subscribe path — but the bug is latent and the field will become load-bearing the
moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
diagnostics, which is what the dictionary's name suggests it was intended for).
Recommendation
Either (a) change _subscriptionHandles to
ConcurrentDictionary<string, string> and use TryAdd / TryRemove, mirroring
DCL-003's fix one layer down, or (b) delete the field entirely since it is never
read — the bookkeeping is fully owned by RealOpcUaClient._monitoredItems /
_callbacks and DataConnectionActor._subscriptionIds. Removing it eliminates the
race and removes dead state in one stroke. Add a regression test (or extend
DCL003_SharedDictionaryFields_AreConcurrentCollections) that asserts no
non-concurrent Dictionary field is shared across thread boundaries in adapter
state.
DataConnectionLayer-020 — HandleSubscribeCompleted double-counts _totalSubscribed when a previously-unresolved tag is resolved by a different instance's subscribe
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688 |
Resolution — split the success branch into "fresh subscribe" vs
"unresolved → resolved promotion": _unresolvedTags.Remove(...) is now
called before incrementing the counters; a promotion increments only
_resolvedTags and clears _resolutionInFlight, mirroring
HandleTagResolutionSucceeded. The symmetric failure branch was also
fixed — _totalSubscribed only increments when
_unresolvedTags.Add(...) returns true so a second instance failing
to resolve the same tag is a no-op on the counter. Tests:
DCL020_UnresolvedTagPromoted_ByDifferentInstance_DoesNotDoubleCountTotalSubscribed
and DCL020_TwoInstancesFailingSameTag_OnlyCountsTagOnceInTotal.
Description
HandleSubscribeCompleted's success branch (line 656-661) writes
_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++; _resolvedTags++;. The guard at line 653 only skips when the tag is already in
_subscriptionIds; it does not check _unresolvedTags. So the success branch
runs for a tag that previously failed resolution from an earlier instance's subscribe
(which incremented _totalSubscribed and added the tag to _unresolvedTags at line
674-676) and is now successfully subscribed by a later instance.
Sequence:
- Instance A subscribes "Tag1".
_adapter.SubscribeAsyncthrows a non-connection-level exception.HandleSubscribeCompletedtakes the resolution-failure branch:_unresolvedTags.Add("Tag1"); _totalSubscribed++;(now 1). - The device finishes booting. Instance B subscribes "Tag1".
_adapter.SubscribeAsyncsucceeds, returningsubId.HandleSubscribeCompletedtakes the success branch:_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;(now_totalSubscribed = 2,_resolvedTags = 1). _unresolvedTagsstill contains "Tag1". The retry timer fires next tick,HandleRetryTagResolutiondispatchesSubscribeAsync("Tag1", ...)against the adapter (creating a second monitored item for the same tag), andHandleTagResolutionSucceededruns_unresolvedTags.Remove("Tag1")→_subscriptionIds["Tag1"] = newSubId(overwriting Instance B's id, orphaning that monitored item) →_resolvedTags++(now 2, matching_totalSubscribed).
Net effect:
_totalSubscribedis over-counted by 1 from step 2 until step 3 reconciles_resolvedTags. During that window the health report's "subscribed / resolved" ratio is wrong.- Two adapter subscription handles for the same tag are leaked across this race
(DCL-018's orphan plus the retry's second adapter call); the second leaks
permanently because
_subscriptionIds["Tag1"]only stores the most recent id.
Recommendation
In HandleSubscribeCompleted's success branch, before the _totalSubscribed++,
check _unresolvedTags.Remove(result.TagPath) — if the tag was already counted as
unresolved, promote it without re-incrementing _totalSubscribed (mirror
HandleTagResolutionSucceeded's shape: only increment _resolvedTags,
_subscriptionIds[tag] = subId, and clear _resolutionInFlight). Add a regression
test that asserts _totalSubscribed / _resolvedTags consistency after the
"resolve via a second instance" sequence.
DataConnectionLayer-021 — HandleSubscribeCompleted re-creates and leaks _subscriptionsByInstance entry when the instance unsubscribed mid-flight
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687 |
Resolution — HandleSubscribeCompleted now detects the
mid-termination race: when _subscriptionsByInstance.TryGetValue fails,
the method logs a warning, clears each owned tag from
_subscribesInFlight, fires fire-and-forget
_adapter.UnsubscribeAsync(result.SubscriptionId!) for every successful
non-AlreadySubscribed result (so the OPC UA monitored items don't keep
streaming for a tag nobody listens to), and returns without applying
counter or handle mutations. Regression test
DCL021_UnsubscribeDuringInFlightSubscribe_ReleasesAdapterHandle_AndKeepsStateClean
parks the subscribe, sends UnsubscribeTagsRequest, then releases the
subscribe and asserts _adapter.UnsubscribeAsync is called and
TotalSubscribedTags == 0.
Description
HandleSubscribe dispatches a Task.Run that performs adapter I/O off the actor
thread and pipes a SubscribeCompleted back. If an UnsubscribeTagsRequest for the
same instance is processed on the actor thread between dispatch and completion,
HandleUnsubscribe removes the instance from both _subscriptionsByInstance and
_subscribers. When the late SubscribeCompleted arrives,
HandleSubscribeCompleted (line 629-634) re-creates the
_subscriptionsByInstance[instanceName] = new HashSet<string>() entry and proceeds
to apply the results — but _subscribers[instanceName] was already removed by the
unsubscribe and is not re-added.
Consequences:
_subscriptionsByInstancekeeps a permanently-leaked entry for an instance that no longer exists.ReSubscribeAllderives its tag list from_subscriptionsByInstance.Valuesand will keep re-subscribing the leaked tags on every future reconnect.- For each tag,
_tagSubscriberCount[tagPath]is incremented (line 647-649), so the reverse index treats the leaked instance as a real subscriber. The only way to drop the count is anotherHandleUnsubscribefor the same instance — which can never arrive because the Instance Actor that owned the instance is gone. - The success branch increments
_totalSubscribed/_resolvedTags(or_unresolvedTagsfor genuine resolution failures), drifting health counters permanently above the actual subscribed instance count. - Subsequent
HandleTagValueReceivedfanout iterates_subscriptionsByInstanceand skips this entry via the_subscribers.TryGetValuecheck (line 1019), so values are silently dropped — but the work of fanning them out (the iteration and the tag lookup) is still done for every value update on every leaked tag, forever. - The genuine-resolution-failure path at line 682-686 (
subscriber.Tell(new TagValueUpdate(..., QualityCode.Bad, ...))) also silently no-ops because_subscribers.TryGetValueis false — so the design doc's "push bad quality on resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
Recommendation
In HandleSubscribeCompleted, when _subscriptionsByInstance.TryGetValue fails,
treat the result as obsolete: log it and return without re-creating the entry or
applying any state mutations. Any successfully-created adapter subscriptions in
msg.Results should be cleaned up — iterate the results and
_adapter.UnsubscribeAsync(result.SubscriptionId!) (fire-and-forget) for each
successful one so the orphan handles do not leak in the adapter. Add a regression
test that subscribes from instance A, immediately sends an UnsubscribeTagsRequest
for A while the subscribe I/O is in flight, completes the subscribe, and asserts
_subscriptionsByInstance, _tagSubscriberCount and health counters are all clean.
DataConnectionLayer-022 — HandleSubscribeCompleted and HandleTagResolutionFailed reset the tag-resolution retry timer on every call via StartPeriodicTimer, starving the retry under subscribe bursts
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998 |
Resolution — both call sites now gate
Timers.StartPeriodicTimer("tag-resolution-retry", ...) with
!Timers.IsTimerActive("tag-resolution-retry") so the first failure
arms the timer and subsequent failures only pile onto _unresolvedTags.
Regression test
DCL022_BurstedFailedSubscribes_DoNotResetRetryTimer fires 5 failed
subscribes within one retry interval and asserts the retry timer
actually fires within one interval of the first failure (pre-fix it
would have been pushed past the interval boundary by the resets).
Description
HandleSubscribeCompleted (line 691-698) and HandleTagResolutionFailed (line
991-998) both call:
Timers.StartPeriodicTimer(
"tag-resolution-retry",
new RetryTagResolution(),
_options.TagResolutionRetryInterval,
_options.TagResolutionRetryInterval);
Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...) cancels and replaces any
existing timer registered under the same key. So every additional subscribe (or
every additional tag-resolution failure) that produces unresolved tags resets the
retry timer's countdown to the full interval — the timer never accumulates
elapsed time across calls.
With the default TagResolutionRetryInterval = 10s, an instance-startup burst that
produces a new SubscribeTagsRequest every 5s (a not-unusual cadence during
deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
"periodic" retry never actually fires until subscribes go quiet. In a steady-state
site with many instances deploying together this can delay tag resolution by tens
of seconds, leaving attributes at Bad quality longer than the documented retry
interval implies.
Recommendation
Start the periodic timer once, when the actor first transitions to having
non-empty _unresolvedTags, and only re-start it after Timers.Cancel(...) has
been called (e.g., when the actor enters Reconnecting). The cleanest pattern is to
gate the start with if (!Timers.IsTimerActive("tag-resolution-retry")) before
calling StartPeriodicTimer — IsTimerActive is on ITimerScheduler. Apply the
same gate at both call sites. Add a regression test that fires 5 subscribes with
unresolved tags within one retry interval and asserts the retry fires at most one
interval after the first failure (not after the fifth subscribe).