ScadaBridge/code-reviews/DataConnectionLayer/findings.md

# Code Review — DataConnectionLayer

| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `4307c381` |
| Open findings | 0 |

## Summary

The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
lifecycle state machine, the captured-`Self` marshalling of background-thread
disconnect events, and the protocol-factory abstraction all follow the design doc
and Akka.NET conventions. However, the review found one **critical** actor-model
violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private
dictionaries and counters from a thread-pool thread, racing with the actor's own
message loop. Several **high**-severity issues cluster around concurrency and error
handling: the subscription-failure path leaves the connection with degraded subtrees
but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops
all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored-
item callback dictionary is mutated without synchronization while OPC UA notification
threads read it. The remaining findings concern stale health counters after failover,
an unused `WriteTimeout` option (writes are unbounded despite the design promising a
30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between
the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.

#### Re-review 2026-05-28 (commit `1eb6e97`)

The 2026-05-28 re-review walked all 10 checklist categories against the current
source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
were verified in place. The new findings cluster around `HandleSubscribe` /
`HandleSubscribeCompleted` race-induced state drift:

- **High** — concurrent subscribes for the same tag from different instances each see
  the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
  the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
  later `HandleSubscribeCompleted` silently discards the second adapter subscription
  handle — the orphan never gets `UnsubscribeAsync`'d.
- **Medium** — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
  mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
  `DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
  `RealOpcUaClient` but missed in the layer above.
- **Medium** — `HandleSubscribeCompleted`'s success branch never checks
  `_unresolvedTags`, so a tag that previously failed resolution (incrementing
  `_totalSubscribed`) and is then successfully subscribed by a different instance gets
  `_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
  lingers until the retry timer also resolves it, creating an orphaned monitored item.
- **Medium** — when an instance is unsubscribed mid-flight,
  `HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
  entry and processes the late results, leaking `_tagSubscriberCount` /
  `_totalSubscribed` / `_resolvedTags` increments for an instance with no
  `_subscribers` entry to deliver values to.
- **Medium** — `HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
  completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
  same key cancels and replaces the existing timer, so a burst of subscribes arriving
  faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
  and the retry never actually fires.

#### Re-review 2026-05-17 (commit `39d737e`)

All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
and found **4 new findings**: one **High** — the DCL-012 security warning is never
seen in production because `RealOpcUaClientFactory.Create()` constructs
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
**Medium** — initial-connect failures in the `Connecting` state never count toward
failover, so a connection whose primary endpoint is unreachable at startup retries the
primary forever and never tries the configured backup; one **Medium** —
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
so a mid-batch disconnect aborts the whole write batch (the same class of defect
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
`DataConnectionLayer-014`.

## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
| 9 | Testing coverage | x | DCL001–017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |

## Checklist coverage (2026-05-17 re-review, commit `39d737e`)

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |

## Findings

#### Re-review 2026-06-20 (commit `4307c381`) — full review

The 2026-06-20 full re-review walked all 10 checklist categories against the current
source, focused on the M7 native-alarm subscribe path, the OPC UA node-browser
(BrowseNext/search/type-info), and the verify-endpoint + site-local cert-trust surface.
The M7 surface is well-built overall — the verify-endpoint probe correctly *captures
but never trusts* an untrusted server certificate, the per-node `CertStoreActor`
broadcast keeps both site nodes' PKI stores consistent across failover, and the
node-browser paging/search/type-info additions are clean. All 22 prior findings remain
`Resolved` and their fixes were verified in place. The review found **4 new findings**,
all clustering on the native-alarm subscribe path, which did **not** inherit the
guards the tag-subscribe path earned across DCL-018 / DCL-021 / DCL-022: the alarm
path leaks its adapter feed on a mid-flight unsubscribe (no DCL-021-style obsolete-
completion guard), has no alarm-resolution retry to match the tag path's
`tag-resolution-retry` timer, leaks `_alarmCts` on `DisposeAsync` (only
`DisconnectAsync` tears it down), and carries a stale "first subscriber wins" comment
on a last-wins filter assignment.

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | No new issues; DCL-016/020/021 counter/response fixes verified. Alarm-path snapshot atomic-swap + per-source prefix routing are correct. |
| 2 | Akka.NET conventions | x | No new issues; DCL-022 `IsTimerActive` gate verified. Alarm subscribe uses `ContinueWith(...).PipeTo(Self)` with `Self`/generation captured (no `Sender`/`this` in closures). |
| 3 | Concurrency & thread safety | x | Finding 023 — native-alarm mid-flight unsubscribe leaks the adapter alarm feed (the tag path's DCL-021 obsolete-completion guard was not mirrored on the alarm path). |
| 4 | Error handling & resilience | x | Finding 024 — no alarm-resolution retry: a failed initial alarm subscribe strands the subscriber with no feed until the next full reconnect (the tag path retries on `tag-resolution-retry`). |
| 5 | Security | x | No new issues. The M7 verify-endpoint probe builds a temporary `RealOpcUaClient`, captures the untrusted server cert, and NEVER trusts it; cert trust is gated through `CertStoreActor` broadcast to both nodes. DCL-012/014 auto-accept warning + Commons default remain out-of-scope follow-ups. |
| 6 | Performance & resource management | x | Finding 025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts` (alarm task + CTS leak on failover/stop); `DisconnectAsync` already cancels+disposes it under `_alarmLock`. |
| 7 | Design-document adherence | x | No new issues; M7 native-alarm read-only mirror, `AlarmKind` discriminator, and per-source `IAlarmSubscribableConnection` feed match the design. DCL-009 doc action still open at doc level (out of scope). |
| 8 | Code organization & conventions | x | No issues — alarm messages/POCOs in Commons, `IAlarmSubscribableConnection` capability seam in the module, options class owned by component. |
| 9 | Testing coverage | x | DCL001–022 regression tests present. Gaps for findings 023 (alarm unsubscribe mid-flight), 024 (failed alarm subscribe → no retry), 025 (`DisposeAsync` alarm-CTS leak). |
| 10 | Documentation & comments | x | Finding 028 — the `_alarmSourceFilter` "first subscriber wins" XML comment contradicts the last-wins assignment in `HandleSubscribeAlarms`. |

### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread

| | |
|--|--|
| Severity | Critical |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` |

**Description**

`HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool
thread and directly mutates the actor's private mutable state: `instanceTags` (a
reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`,
`_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and
written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`,
`ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a
direct violation of the Akka.NET actor model, which guarantees single-threaded access
to actor state only when state is touched on the actor thread. Two concurrent
subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`,
produce data races on `Dictionary`/`HashSet`/`int` — `Dictionary` is not thread-safe
and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
also produce torn reads of the health counters.

**Recommendation**

Do not mutate actor state from the background task. Perform only the `await
_adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results
into a local immutable result object, and `PipeTo(Self)` an internal message (e.g.
`SubscribeCompleted`) whose handler — running on the actor thread — applies all state
mutations and counter updates. The response to `Sender` should be sent from that
handler too.

**Resolution**

Resolved 2026-05-16. `HandleSubscribe` was restructured to follow the actor's own
`PipeTo(Self)` pattern (the one already used by `HandleRetryTagResolution`): the
background `Task.Run` now performs only adapter I/O (`SubscribeAsync`/`ReadAsync`),
collects per-tag outcomes into an immutable `SubscribeCompleted` message, and pipes
that to `Self`. All mutation of `_subscriptionIds`, `_subscriptionsByInstance`,
`_totalSubscribed`, `_resolvedTags` and `_unresolvedTags` now happens in the new
`HandleSubscribeCompleted` handler on the actor thread; it is wired into the
Connected, Connecting and Reconnecting states so an in-flight subscribe is applied
regardless of state transitions. Regression test
`DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters` (30×30 concurrent
subscribes) fails against the pre-fix code and passes after. Fixed by the commit
whose message references `DataConnectionLayer-001`.

### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash

| | |
|--|--|
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |

**Description**

`DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with
`Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a
fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`,
`_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are
silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the
design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
that had subscribed before the crash never get their tags re-subscribed and will sit
at uncertain/stale quality indefinitely with no error returned. There is no durable
subscription store from which a restarted actor could rebuild state.

**Recommendation**

Either (a) make the subscription registry durable/recoverable so a restarted actor
can rebuild it (persist to local SQLite as the design doc says connection definitions
are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash
as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the
subscription registrations. At minimum document that subscribers must re-register
after a crash and surface the lost-state condition rather than failing silently.

**Resolution**

Resolved 2026-05-16. The `DataConnectionManagerActor.SupervisorStrategy` was changed
from `Directive.Restart` to `Directive.Resume` for `DataConnectionActor` failures.
`Resume` keeps the existing actor instance and all its in-memory subscription state
(`_subscriptionsByInstance`, `_subscriptionIds`, `_subscribers`, quality counters)
intact across a transient handler exception, so the design doc's "transparent
re-subscribe" guarantee (WP-10) is preserved. The actor is a long-lived stateful
coordinator and its own Become/Stash reconnect state machine already recovers
connection-level faults — it does not need a restart. This also aligns with the
ScadaBridge convention of `Resume` for coordinator actors. Regression test
`DCL002_ConnectionActorCrash_PreservesSubscriptionState` crashes the connection actor
via a synchronously-throwing write and asserts the subscription survives (health
report still shows 1 subscribed/resolved tag); it fails against the pre-fix `Restart`
code and passes after. Fixed by the commit whose message references
`DataConnectionLayer-002` (commit `<pending>`).

### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization

| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |

**Description**

`_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are
written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the
`DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool
threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from
the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on
the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line
163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined
behaviour — they can throw `InvalidOperationException`, return wrong entries, or
corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off
the actor thread (finding 001), so multiple subscribe calls can also race each other
here.

**Recommendation**

Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all
access with a lock. Note that fixing finding 001 (serialising subscribe through the
actor thread) reduces but does not eliminate the race, because the SDK notification
threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
`DisconnectAsync`.

**Resolution**

Resolved 2026-05-16. `_monitoredItems` and `_callbacks` in `RealOpcUaClient` were
changed from plain `Dictionary<,>` to `ConcurrentDictionary<,>`, and the two
`Remove(key)` call sites switched to `TryRemove`. This makes the maps safe to read
from the OPC Foundation SDK's publish threads (`MonitoredItem.Notification` reading
`_callbacks`) concurrently with subscribe/disconnect mutations on other threads.
`RealOpcUaClient` wraps concrete OPC Foundation SDK types (`ISession`,
`Subscription`, `MonitoredItem`) and cannot be exercised without a live OPC UA
server, so the regression is guarded structurally by
`DCL003_SharedDictionaryFields_AreConcurrentCollections` (a reflection test asserting
both fields are `ConcurrentDictionary<,>`); it fails against the pre-fix `Dictionary`
code and passes after. Fixed by the commit whose message references
`DataConnectionLayer-003` (commit `<pending>`).

### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly

| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |

**Description**

When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task,
the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`,
treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync`
also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA
client is not connected, and throws on transport faults — these are connection
problems, not bad tag paths. They get misclassified as unresolved tags and retried on
the 10 s tag-resolution timer instead of triggering the reconnection state machine.
Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
must be marked quality `bad`; the code never pushes a bad-quality update to the
subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends
to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`).

**Recommendation**

Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the
reconnect machine handle them) from genuine node-not-found errors. For genuine
resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing
Instance Actor so it reflects the documented behaviour.

**Resolution**

Resolved 2026-05-16. The subscribe background task now classifies each subscribe
exception via the new `IsConnectionLevelFailure` helper (`InvalidOperationException`
— thrown by `EnsureConnected()` — plus `SocketException`/`TimeoutException`/
`IOException` count as connection-level; anything else is a genuine resolution
failure). The classification is carried on `SubscribeTagResult.ConnectionLevelFailure`
and applied on the actor thread in `HandleSubscribeCompleted`: connection-level
failures no longer become unresolved tags and instead drive the reconnection state
machine (`HandleSubscribeCompleted` returns a flag and the Connected-state handler
calls `BecomeReconnecting`); genuine resolution failures still go to `_unresolvedTags`
and the retry timer, and now also push a `TagValueUpdate` with `QualityCode.Bad` to
the subscribing Instance Actor, matching the design doc's Tag Path Resolution step 2.
Regression tests `DCL004_GenuineTagResolutionFailure_PushesBadQualityToSubscriber`
and `DCL004_ConnectionLevelSubscribeFailure_TriggersReconnect_NotTagRetry` both fail
against the pre-fix code and pass after. Fixed by the commit whose message references
`DataConnectionLayer-004` (commit `<pending>`).

### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied

| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |

**Description**

`DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared
Settings" table both promise a bounded timeout for synchronous device writes. The
value is never read anywhere in the module (`grep` confirms only the declaration).
`HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no
`CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no
RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling
script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
error. The design states write failures (including timeout) must be returned
synchronously to the script; an unbounded write violates that.

**Recommendation**

Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to
`WriteAsync`, and in the continuation translate cancellation into a failed
`WriteTagResponse` with a timeout error message. Apply the same to the read used by
the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable.

**Resolution**

Resolved 2026-05-16. `HandleWrite` now creates a `CancellationTokenSource(_options.WriteTimeout)`,
passes its token to `_adapter.WriteAsync(...)`, and disposes the source in the
continuation. A cancelled/timed-out write (`Task.IsCanceled` or a base
`OperationCanceledException`) is translated into a failed `WriteTagResponse` with a
`"Write timeout after Ns"` message, so a hung device write is bounded and the failure
is returned synchronously to the calling script (WP-11) instead of blocking until the
script's own Ask-timeout. (The `WriteBatchAndWaitAsync` adapter path already accepts
an explicit `timeout`/`CancellationToken` and is not invoked by `HandleWrite`, so no
change was needed there.) Regression test
`DCL005_Write_ThatHangs_TimesOutAndReturnsFailureSynchronously` uses an adapter whose
`WriteAsync` only completes when its token fires; it fails against the pre-fix
unbounded code and passes after. Fixed by the commit whose message references
`DataConnectionLayer-005` (commit `<pending>`).

### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` |

**Description**

`ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a
clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and
`_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect)
sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a
reconnect, `HandleTagValueReceived` decrements the *old* bucket using
`_lastTagQuality`'s value and increments the new one — but tags resolved for the first
time after reconnect were never in `_lastTagQuality`, so they only increment, never
decrement, and the totals can drift above `_totalSubscribed`. Over repeated
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
unreliable.

**Verification note**: Confirmed against source. The root cause is broader than the
reconnect path the finding describes: `HandleUnsubscribe` also never removes a tag
from `_lastTagQuality` nor decrements its quality bucket, so an unsubscribed tag
lingers and `PushBadQualityForAllTags` (which sets `_tagsBadQuality =
_lastTagQuality.Count`) over-counts it — driving the bad-quality count above
`_totalSubscribed` even without a re-subscribe. Both the unsubscribe leak and the
re-subscribe drift are real.

**Recommendation**

On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear
`_lastTagQuality` and the three quality counters and let them be repopulated from
fresh `TagValueReceived` messages. Alternatively recompute the buckets from
`_lastTagQuality` whenever it changes rather than maintaining incremental counters.

**Resolution**

Resolved 2026-05-16 (commit pending). `HandleUnsubscribe` now removes each
unsubscribed tag from `_lastTagQuality` and decrements the corresponding quality
bucket, then reports the corrected counters via `UpdateTagQuality`/`UpdateTagResolution`;
`ReSubscribeAll` clears `_lastTagQuality` and zeroes the three quality counters so
post-reconnect tags are repopulated from fresh `TagValueReceived` messages instead of
only incrementing. Regression test
`DCL006_DisconnectAfterUnsubscribe_BadQualityCountMatchesRemainingTags` subscribes two
tags, pushes Good values, unsubscribes one, then disconnects and asserts
`PushBadQualityForAllTags` reports exactly 1 bad tag (the reconnect is gated open so
`ReSubscribeAll` does not run before the assertion); it reports 2 against the pre-fix
code and 1 after.

### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` |

**Description**

`ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any
non-cancellation exception (line 184). So if any single tag in the batch throws (bad
node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no
results for the tags that *did* read successfully — even though `ReadResult` already
has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is
also fully serial (one round-trip per tag), defeating the point of a batch API; the
design doc lists `ReadBatch`/`WriteBatch` as first-class operations.

**Verification note**: Confirmed against source — `ReadAsync` re-throws on any
non-`OperationCanceledException`, aborting the whole batch.

**Recommendation**

Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag
so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call
for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a
`ReadValueIdCollection` — extend it to accept multiple nodes).

**Resolution**

Resolved 2026-05-16 (commit pending). `ReadBatchAsync` now wraps each per-tag
`ReadAsync` in a try/catch: a per-tag exception is recorded as a failed `ReadResult`
(`Success: false`, message = the exception message) so the batch returns a complete
result map for every requested tag; `OperationCanceledException` is still propagated
so a cancelled batch aborts as a whole. The per-tag-serial loop and single-service-call
optimisation were deliberately left for a follow-up — they are a performance concern,
not the correctness bug this finding raised. Regression test
`DCL007_ReadBatch_ReturnsPerTagResults_WhenOneTagFails` reads three tags where the
middle one throws and asserts all three appear in the result map with the failing one
marked unsuccessful; it threw (no map returned) against the pre-fix code and passes
after.

### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly

| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` |

**Description**

For each tag of the instance being removed, `HandleUnsubscribe` scans every other
instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation
O(tags x instances). On a site with many instances sharing a connection this is
needlessly expensive on every instance stop/redeploy. Separately, line 562
re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already
removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead
logic that obscures intent (the decrement of `_resolvedTags` is unconditional in
practice).

**Recommendation**

Maintain a reference count per tag path (or a `tagPath -> set<instance>` reverse index)
so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags`
re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
prior state captured before removal.

**Resolution**

Resolved 2026-05-16 (commit pending). A `tagPath → subscriber-count` reverse index
(`_tagSubscriberCount`) was added: `HandleSubscribeCompleted` increments it whenever a
tag is newly added to an instance's set, and `HandleUnsubscribe` decrements it,
releasing a tag at the adapter only when the count reaches zero. The "any other
subscriber" check is now O(1) per tag instead of an O(instances) `Where(...).Any()`
scan. The redundant `!_unresolvedTags.Contains(tagPath)` re-check (always true after
the unconditional `Remove` on the line above) was removed — the surviving branch is
entered only for tags that have a subscription id, which are by definition resolved,
so `_resolvedTags--` is now unconditional with an explanatory comment. The cleanup
also fixed a latent leak the original code could not reach: an unresolved tag whose
last subscriber unsubscribes is now removed from `_unresolvedTags`/`_resolutionInFlight`
and decremented from `_totalSubscribed` (previously it lingered in the retry timer and
the subscribed total forever). Regression test
`DCL008_Unsubscribe_OnlyReleasesTagWhenLastSubscriberLeaves` subscribes a tag from two
instances plus an exclusive tag, then unsubscribes each instance and asserts the
shared tag is released at the adapter only after the last subscriber leaves and the
health counters track correctly. (This finding is a performance refactor, not a
correctness bug — the pre-fix `Where(...).Any()` logic was functionally correct, so
the test passes against both versions and serves as a behavioural guard for the
refactor.)

### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine

| | |
|--|--|
| Severity | Medium — partially design-doc work outside this module's editable scope |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` |

**Description**

The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
(>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate*
failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on
connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally
counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped
within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count
too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
happen on *successful-but-flaky* connections are not described in the component doc at
all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
`DataConnectionOptions`.

**Verification note**: Confirmed against source. The hard-coded
`StableConnectionThreshold = TimeSpan.FromSeconds(60)` `static readonly` field and the
`_consecutiveUnstableDisconnects` failover path both exist as described.

**Recommendation**

Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover
path and the stability threshold, and move the 60 s threshold into
`DataConnectionOptions` so it is configurable and consistent with the other tunables.

**Resolution**

Resolved 2026-05-16 (commit pending). The configurability half of the recommendation
is done: the hard-coded `StableConnectionThreshold` constant was removed from
`DataConnectionActor` and replaced with a new `DataConnectionOptions.StableConnectionThreshold`
property (60 s default), bindable from the `DataConnectionLayer` `appsettings.json`
section like `ReconnectInterval`/`TagResolutionRetryInterval`/`WriteTimeout`. Regression
test `DCL009_StableConnectionThreshold_IsConfigurable_WithSixtySecondDefault` guards
the default and the setter. **The documentation half is out of this module's editable
scope** — `docs/requirements/Component-DataConnectionLayer.md` (lines 73-85) still
describes only the connect-failure failover path and does not mention the
unstable-disconnect trigger. **Action required (surfaced):** the DCL design doc should
be updated to document the unstable-disconnect failover path and the configurable
stability threshold; that edit was deliberately not made here because this task is
scoped to `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer`, tests, and this findings file only.

### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` |

**Description**

`HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags`
via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from
`_unresolvedTags` while the attempts are in flight. Because tags are not removed
before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues
duplicate concurrent subscribe attempts for the same tag, which can create duplicate
monitored items / leaked subscription IDs (the second success overwrites
`_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle
with no `UnsubscribeAsync` call). The timer-cancel condition in
`HandleTagResolutionSucceeded` is also non-deterministic for the same reason.

**Verification note**: Confirmed against source — `HandleRetryTagResolution` dispatched
`SubscribeAsync` for every tag in `_unresolvedTags` on every tick with no in-flight
guard.

**Recommendation**

Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is
dispatched, and only put them back on failure. This prevents overlapping duplicate
subscribe attempts and makes the timer-cancel condition deterministic.

**Resolution**

Resolved 2026-05-16 (commit pending). A new `_resolutionInFlight` `HashSet<string>`
tracks tags whose retry `SubscribeAsync` is currently outstanding.
`HandleRetryTagResolution` now dispatches only for unresolved tags **not** already in
flight (and skips entirely if all are in flight), adding each dispatched tag to the
set; `HandleTagResolutionSucceeded` and `HandleTagResolutionFailed` remove the tag
from the set when its attempt completes, and `HandleUnsubscribe`/`ReSubscribeAll`
clear stale entries. This prevents overlapping duplicate subscribe attempts and the
resulting orphaned monitored items. Regression test
`DCL010_TagResolutionRetry_DoesNotIssueDuplicateConcurrentSubscribes` gives a tag a
genuine initial failure then a retry `SubscribeAsync` that never completes, lets six
100 ms retry ticks elapse, and asserts exactly one retry was dispatched (2 total
subscribe calls); the pre-fix code dispatched on every tick (6 total).

### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` |

**Description**

On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`,
fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the
`Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches
`Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads
have not yet fully torn down could still deliver a value, mixing pre-failover device
data with the new endpoint's data and briefly reporting a value the active endpoint
never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to
distinguish current from stale callbacks.

**Verification note**: Confirmed against source — `TagValueReceived` carried no
adapter identity, and `HandleTagValueReceived` (reachable in `Connected`) processed
any such message regardless of which adapter produced it.

**Recommendation**

Add an adapter-generation counter incremented on every adapter swap; stamp it onto
`TagValueReceived` (captured in the callback closure) and drop messages whose
generation does not match the current adapter in `HandleTagValueReceived`.

**Resolution**

Resolved 2026-05-16 (commit pending). Implemented exactly as recommended: a new
`_adapterGeneration` `int` field is incremented at both adapter-swap sites (the
unstable-disconnect failover in `BecomeReconnecting` and the connect-failure failover
in `HandleReconnectResult`). The `TagValueReceived` record gained an
`AdapterGeneration` field; every subscription callback closure (`HandleSubscribe`, the
initial-read seed, `HandleRetryTagResolution`, `ReSubscribeAll`) captures the
generation in effect at subscribe time and stamps it onto each `TagValueReceived`.
`HandleTagValueReceived` drops any message whose generation no longer matches the
current adapter, so a callback fired by a disposed adapter after failover cannot reach
an Instance Actor. Regression test
`DCL011_StaleTagValueFromOldAdapter_IsNotForwardedAfterFailover` subscribes on the
primary, fails over to the backup, then invokes the captured primary callback with a
stale value and asserts the subscriber receives nothing; the stale value reached the
subscriber against the pre-fix code and is dropped after.

### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate

| | |
|--|--|
| Severity | Medium — full secure default also requires a Commons + design-doc change outside this module |
| Category | Security |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` |

**Description**

`OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and
`RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true`
when it is set. With the default, every server certificate is accepted unconditionally
— there is no certificate-pinning or trust-store enforcement — which defeats the
`Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC
UA link. The design doc explicitly lists `true` as the default. For an industrial
control link this is a meaningful exposure; a secure-by-default posture would reject
untrusted certs unless an operator opts in per connection.

**Verification note**: Confirmed against source. Note the *authoritative* runtime
default does not actually live on `OpcUaConnectionOptions` — for a real connection
`OpcUaDataConnection.ConnectAsync` builds `OpcUaConnectionOptions` from
`OpcUaEndpointConfig` (in `ZB.MOM.WW.ScadaBridge.Commons`), whose `AutoAcceptUntrustedCerts`
property also defaults to `true`. `OpcUaConnectionOptions`' own default is only the
fallback used when an `OpcUaConnectionOptions` is constructed directly.

**Recommendation**

Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection
opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
installed. Update the design doc to reflect the secure default.

**Resolution**

Resolved 2026-05-16 (commit pending). The two in-scope parts of the recommendation
are done: (1) `OpcUaConnectionOptions.AutoAcceptUntrustedCerts` now defaults to
`false`; (2) `RealOpcUaClient.ConnectAsync` logs a prominent `ILogger` warning
whenever the auto-accept certificate validator is installed (an `ILogger<RealOpcUaClient>`
was added as an optional constructor parameter, defaulting to `NullLogger`, so
existing callers are unaffected). Regression test
`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse` guards the
new secure default. **Two parts remain outside this module's editable scope and are
surfaced as action required:** (a) `ZB.MOM.WW.ScadaBridge.Commons.Types.DataConnections.OpcUaEndpointConfig.AutoAcceptUntrustedCerts`
still defaults to `true` — since that is the value actually used for a real connection
(see verification note above), the Commons default must also be flipped to `false`
for the system to be secure-by-default; (b) `docs/requirements/Component-DataConnectionLayer.md`
line 116 still documents `true` as the default and must be updated. Both edits were
deliberately not made here because this task is scoped to
`src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer`, tests, and this findings file only.

### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` |

**Description**

The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers
the event." The implementation is a non-atomic check-then-set on a `volatile bool`
(`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees
visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
`OnClientConnectionLost` and a `ReadAsync` failure path) can both observe
`_disconnectFired == false` and both invoke `Disconnected`. In practice the
`DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact
is low, but the comment overstates the guarantee. The same pattern exists in
`RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`).

**Recommendation**

Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock),
or correct the comment to say "best-effort once-only; a duplicate event is possible
under a race and is tolerated downstream."

**Resolution**

Resolved 2026-05-16 (commit pending). Rather than weaken the XML comment to match the
weak guard, the guard was made genuinely atomic so the documented "only the first
caller fires the event" guarantee becomes true. `OpcUaDataConnection._disconnectFired`
and `RealOpcUaClient._connectionLostFired` were changed from `volatile bool` to `int`,
and the check-then-set in `RaiseDisconnected` / `OnSessionKeepAlive` replaced with a
single `Interlocked.Exchange(ref flag, 1) != 0` compare-and-set; the reset on connect
uses `Interlocked.Exchange(ref flag, 0)`. The XML comments on both methods were updated
to describe the atomic compare-and-set explicitly. Regression test
`DCL013_ConcurrentConnectionLost_RaisesDisconnectedExactlyOnce` runs 25 rounds, each
fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost` event
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
non-atomic check-then-set it double-fires (verified by temporarily reverting the
guard), and it passes against the atomic fix.

### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger

| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |

**Description**

Finding DataConnectionLayer-012 was resolved in part by adding a prominent
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).

However, the only production code path that constructs a `RealOpcUaClient` is
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
As a result the `_logger` field is always `NullLogger` for every real OPC UA
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
enabled — accepting any server certificate on an industrial control link — gets no
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
therefore not actually effective in production; only the unit test
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
checks the default value) passes.

**Recommendation**

Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
the default is `false`.

**Resolution**

Resolved 2026-05-17 (commit pending). `RealOpcUaClientFactory` gained an
`ILoggerFactory` constructor parameter and `Create()` now passes
`_loggerFactory.CreateLogger<RealOpcUaClient>()` into every `RealOpcUaClient` it builds;
`DataConnectionFactory` (which already holds an `ILoggerFactory`) now constructs
`RealOpcUaClientFactory(globalOptions, _loggerFactory)`, so the DCL-012 auto-accept-cert
warning reaches a real logger in production instead of being discarded by `NullLogger`.
The parameterless / single-arg factory constructors still default to `NullLoggerFactory`
so existing callers are unaffected. Regression tests
`DCL014_RealOpcUaClientFactory_CreatesClientWithRealLogger` and
`DCL014_DataConnectionFactory_ThreadsLoggerToRealOpcUaClient` fail against the pre-fix
code (the first fails to compile — no logger ctor; the second observes a `NullLogger`)
and pass after.

### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |

**Description**

Failover between the primary and backup endpoints is implemented in two places, both
reachable only after a connection has already been `Connected` at least once:
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.

`HandleConnectResult` — the handler for the *initial* connection attempt in the
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.

Consequence: if the primary endpoint is unreachable when the connection actor first
starts — which is the common case after a fresh artifact deployment, a site restart,
or a primary that is simply down at that moment — the actor retries the *primary*
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
active endpoint becomes unreachable") is silently not honoured for the
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
despite a healthy backup being configured.

**Recommendation**

Make `HandleConnectResult` participate in the failover counter the same way
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
log the failover event) before re-arming the timer. Alternatively, fold the initial
connect into the same reconnect path so there is a single failover decision point. Add
a regression test for "primary down at startup, backup configured → fails over to
backup".

**Resolution**

Resolved 2026-05-17 (commit pending). The endpoint-switch failover logic was extracted
from `HandleReconnectResult` into a shared `CountFailureAndMaybeFailover` helper, and
`HandleConnectResult` (the initial-connect handler in the `Connecting` state) now
increments `_consecutiveFailures` and calls that helper on failure — so a primary that
is unreachable at startup fails over to the configured backup after
`FailoverRetryCount` attempts instead of retrying the primary forever. Single-endpoint
connections (no backup) still retry indefinitely. Regression test
`DCL015_PrimaryDownAtStartup_FailsOverToBackup` fails against the pre-fix code (the
backup adapter is never created) and passes after; `DCL015_SingleEndpointDownAtStartup_RetriesIndefinitely_NoFailover`
guards the no-backup path.

### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |

**Description**

`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
so the `Connected`-state handler can drive the actor into `Reconnecting`
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
level.

So when a subscribe arrives while the adapter is silently down, the Instance Actor is
told the subscribe **succeeded**, while the connection actor simultaneously transitions
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
healthy" from "accepted, but the connection is currently down" — a misleading
success signal on a request that did not do what the response claims.

(Genuine tag-resolution failures are arguably also reported as overall `true`, but
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
already pushed. The connection-level case is the clear defect because the actor itself
treats it as a failure worth a state transition.)

**Recommendation**

When `connectionLevelFailure` is true, reply with
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
Instance Actor can reflect partial success. Add a test asserting the response is not
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.

**Resolution**

Resolved 2026-05-17 (commit pending). `HandleSubscribeCompleted` now replies
`SubscribeTagsResponse(success: false, error: "connection unavailable — will
re-subscribe on reconnect")` when `connectionLevelFailure` is true — matching the
actor's own assessment as it drives into `Reconnecting` — instead of unconditionally
replying `success: true`. Genuine tag-resolution failures still reply `success: true`
(they are a runtime quality concern tracked via `_unresolvedTags`, with a `Bad`-quality
`TagValueUpdate` already pushed), so that case is not regressed. Regression test
`DCL016_ConnectionLevelSubscribeFailure_RepliesWithUnsuccessfulResponse` fails against
the pre-fix code (it returned `Success: true`) and passes after;
`DCL016_GenuineResolutionFailure_StillRepliesSuccess` guards the resolution-failure path.

### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect

| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |

**Description**

`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
status does not abort the batch), but it first calls `EnsureConnected()`
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
caller gets no result map — losing the per-tag outcomes for the tags that already
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.

Severity is Low because device writes are real-time control operations with no
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
maintainability and correctness wart.

**Recommendation**

Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
and continues, while still propagating `OperationCanceledException` to abort a
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
complete, consistent result map.

**Resolution**

Resolved 2026-05-17 (commit pending). `WriteBatchAsync` now wraps each per-tag
`WriteAsync` call in a try/catch — mirroring the DCL-007 fix for `ReadBatchAsync` — so a
mid-batch fault (the `InvalidOperationException` from `EnsureConnected` on a dropped
connection) is recorded as a failed `WriteResult(false, ex.Message)` for that tag and
the batch returns a complete result map; `OperationCanceledException` is still
propagated so a cancelled batch aborts as a whole. Callers (including
`WriteBatchAndWaitAsync`) now get a consistent per-tag result shape instead of an
unhandled exception. Regression test
`DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
pre-fix code (the batch throws, no map returned) and passes after;
`DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.

### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle

| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |

**Resolution** — added a `_subscribesInFlight` HashSet mirroring the
existing `_resolutionInFlight` pattern. `HandleSubscribe` now partitions
each request's tags on the actor thread into "this request will
SubscribeAsync" vs. "already subscribed by us OR by another in-flight
request"; the second arrival sees the tag in `_subscribesInFlight` and
treats it as `AlreadySubscribed: true` without issuing a duplicate
adapter call. `HandleSubscribeCompleted` removes each
non-AlreadySubscribed result from the set; `ReSubscribeAll` clears it on
reconnect. Regression test
`DCL018_ConcurrentSubscribes_SameTag_DifferentInstances_IssueOneAdapterSubscribe`
parks the first subscribe in flight, fires a second for the same tag,
and asserts exactly one adapter SubscribeAsync call.

**Description**

`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
set on the actor thread before dispatching the `Task.Run` that performs the adapter
I/O (line 557). The snapshot is the only basis on which the background task decides
whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
runs.

If two `SubscribeTagsRequest` messages arrive on the actor thread for different
instances that both reference the same tag path, both `HandleSubscribe` invocations
take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
does not contain the shared tag in either snapshot. Both background tasks then call
`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
back to the actor with `AlreadySubscribed: false, Success: true`.

`HandleSubscribeCompleted` for the first message takes the success branch and writes
`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
adapter subscription) is silently discarded. The orphan monitored item stays alive in
the OPC UA session for the lifetime of the adapter, sending duplicate data-change
notifications (whose callbacks were stamped with the captured `generation`) into
`HandleTagValueReceived` for every value change. Across a deploy that creates many
instances sharing a few tags, this leaks N-1 monitored items per shared tag and
doubles/triples the per-tag publish traffic.

DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
path.

**Recommendation**

Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
**before** the `Task.Run` dispatch, only for tags not already in
`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
without touching the adapter. Remove from `_subscribesInFlight` in
`HandleSubscribeCompleted` once the result is applied. Add a regression test that
fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).

### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations

| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |

**Resolution** — deleted the dead `_subscriptionHandles` field outright.
Subscription bookkeeping lives in `RealOpcUaClient._monitoredItems` /
`_callbacks` (already `ConcurrentDictionary` per DCL-003) at the device
layer and `DataConnectionActor._subscriptionIds` at the actor layer;
the adapter had no live reader and the field was a latent
race-condition trap. Added structural regression test
`DCL019_OpcUaDataConnection_HasNoNonConcurrentSharedDictionary` that
reflects over the adapter's fields and fails if any plain
`Dictionary<,>` is reintroduced.

**Description**

`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
string>`. It is mutated from:

- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
  after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
  executes on the continuation thread (a thread-pool thread).
- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
  similarly after an `await`.
- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
  **not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
  `UnsubscribeAsync` calls can run in parallel from the upper layer.

The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
run off the actor thread:

- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
  can run in parallel — see DCL-018);
- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
  `_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
  via the SDK's async machinery);
- `ReSubscribeAll` does the same after a reconnect.

So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
concurrent resize are undefined behaviour: they can throw
`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.

This is `_subscriptionHandles` is currently dead state (the dictionary is written to
and `Remove`d but **never read**), so a corruption today would not crash the
subscribe path — but the bug is latent and the field will become load-bearing the
moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
diagnostics, which is what the dictionary's name suggests it was intended for).

**Recommendation**

Either (a) change `_subscriptionHandles` to
`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
DCL-003's fix one layer down, or (b) delete the field entirely since it is never
read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
race and removes dead state in one stroke. Add a regression test (or extend
`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
non-concurrent `Dictionary` field is shared across thread boundaries in adapter
state.

### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |

**Resolution** — split the success branch into "fresh subscribe" vs
"unresolved → resolved promotion": `_unresolvedTags.Remove(...)` is now
called before incrementing the counters; a promotion increments only
`_resolvedTags` and clears `_resolutionInFlight`, mirroring
`HandleTagResolutionSucceeded`. The symmetric failure branch was also
fixed — `_totalSubscribed` only increments when
`_unresolvedTags.Add(...)` returns `true` so a second instance failing
to resolve the same tag is a no-op on the counter. Tests:
`DCL020_UnresolvedTagPromoted_ByDifferentInstance_DoesNotDoubleCountTotalSubscribed`
and `DCL020_TwoInstancesFailingSameTag_OnlyCountsTagOnceInTotal`.

**Description**

`HandleSubscribeCompleted`'s success branch (line 656-661) writes
`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
runs for a tag that previously failed resolution from an earlier instance's subscribe
(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
674-676) and is now successfully subscribed by a later instance.

Sequence:

1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
   exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
   `_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
   succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
   `_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
   (now `_totalSubscribed = 2`, `_resolvedTags = 1`).
3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
   `HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
   adapter (creating a **second** monitored item for the same tag), and
   `HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")` →
   `_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
   monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).

Net effect:

- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
  `_resolvedTags`. During that window the health report's "subscribed / resolved"
  ratio is wrong.
- Two adapter subscription handles for the same tag are leaked across this race
  (DCL-018's orphan plus the retry's second adapter call); the second leaks
  permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.

**Recommendation**

In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
"resolve via a second instance" sequence.

### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |

**Resolution** — `HandleSubscribeCompleted` now detects the
mid-termination race: when `_subscriptionsByInstance.TryGetValue` fails,
the method logs a warning, clears each owned tag from
`_subscribesInFlight`, fires fire-and-forget
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` for every successful
non-AlreadySubscribed result (so the OPC UA monitored items don't keep
streaming for a tag nobody listens to), and returns without applying
counter or handle mutations. Regression test
`DCL021_UnsubscribeDuringInFlightSubscribe_ReleasesAdapterHandle_AndKeepsStateClean`
parks the subscribe, sends `UnsubscribeTagsRequest`, then releases the
subscribe and asserts `_adapter.UnsubscribeAsync` is called and
`TotalSubscribedTags == 0`.

**Description**

`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
same instance is processed on the actor thread between dispatch and completion,
`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
`_subscribers`. When the late `SubscribeCompleted` arrives,
`HandleSubscribeCompleted` (line 629-634) **re-creates** the
`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
to apply the results — but `_subscribers[instanceName]` was already removed by the
unsubscribe and is **not** re-added.

Consequences:

1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
   no longer exists. `ReSubscribeAll` derives its tag list from
   `_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
   every future reconnect.
2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
   reverse index treats the leaked instance as a real subscriber. The only way to
   drop the count is another `HandleUnsubscribe` for the same instance — which can
   never arrive because the Instance Actor that owned the instance is gone.
3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
   `_unresolvedTags` for genuine resolution failures), drifting health counters
   permanently above the actual subscribed instance count.
4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
   skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
   are silently dropped — but the work of fanning them out (the iteration and the
   tag lookup) is still done for every value update on every leaked tag, forever.
5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
   TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
   `_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
   resolution failure" promise is broken for this case (a minor, edge-case wrinkle).

**Recommendation**

In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
treat the result as obsolete: log it and `return` without re-creating the entry or
applying any state mutations. Any successfully-created adapter subscriptions in
`msg.Results` should be cleaned up — iterate the results and
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
successful one so the orphan handles do not leak in the adapter. Add a regression
test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
for A while the subscribe I/O is in flight, completes the subscribe, and asserts
`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.

### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts

| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |

**Resolution** — both call sites now gate
`Timers.StartPeriodicTimer("tag-resolution-retry", ...)` with
`!Timers.IsTimerActive("tag-resolution-retry")` so the first failure
arms the timer and subsequent failures only pile onto `_unresolvedTags`.
Regression test
`DCL022_BurstedFailedSubscribes_DoNotResetRetryTimer` fires 5 failed
subscribes within one retry interval and asserts the retry timer
actually fires within one interval of the first failure (pre-fix it
would have been pushed past the interval boundary by the resets).

**Description**

`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
991-998) both call:

```
Timers.StartPeriodicTimer(
    "tag-resolution-retry",
    new RetryTagResolution(),
    _options.TagResolutionRetryInterval,
    _options.TagResolutionRetryInterval);
```

`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
existing timer registered under the same key. So every additional subscribe (or
every additional tag-resolution failure) that produces unresolved tags **resets** the
retry timer's countdown to the full interval — the timer never accumulates
elapsed time across calls.

With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
"periodic" retry never actually fires until subscribes go quiet. In a steady-state
site with many instances deploying together this can delay tag resolution by tens
of seconds, leaving attributes at `Bad` quality longer than the documented retry
interval implies.

**Recommendation**

Start the periodic timer once, when the actor first transitions to having
non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply the
same gate at both call sites. Add a regression test that fires 5 subscribes with
unresolved tags within one retry interval and asserts the retry fires at most one
interval after the first failure (not after the fifth subscribe).

### DataConnectionLayer-023 — Native-alarm mid-flight unsubscribe leaks the adapter alarm feed

| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1773,1790-1809,1864-1882` |

**Description**

The native-alarm subscribe path mirrors the tag-subscribe path's
dispatch-I/O-then-`PipeTo(Self)` shape but never inherited the obsolete-completion
guard the tag path earned in DCL-021. `HandleSubscribeAlarms` adds the source to
`_alarmSubscribesInFlight` (line 1773) and dispatches
`alarmable.SubscribeAlarmsAsync(...)` whose `ContinueWith` pipes an
`AlarmSubscribeCompleted` back to `Self`. If an `UnsubscribeAlarmsRequest` for the
last (or only) subscriber is processed on the actor thread between that dispatch and
the completion, `HandleUnsubscribeAlarms` (lines 1864-1882) removes the subscriber,
empties `_alarmSourceSubscribers`, and removes the filter entries — but it tries to
tear down the adapter feed via `_alarmSubscriptionIds.Remove(...)`, which **fails**
because the subscription id has not been stored yet (it is still in flight). It also
leaves the stale `_alarmSubscribesInFlight` entry in place. The late
`AlarmSubscribeCompleted` then reaches `HandleAlarmSubscribeCompleted` (lines
1790-1809), which **unconditionally** stores `_alarmSubscriptionIds[msg.SourceReference]
= msg.SubscriptionId` (line 1796) without re-checking whether any subscriber still
exists in `_alarmSourceSubscribers`.

The net result is an orphaned adapter alarm feed: the OPC UA monitored-item set / the
MxGateway gateway-wide alarm stream stays alive for a source with **zero** subscribers,
streaming transitions into `HandleAlarmTransitionReceived` that match no subscriber set
and are silently dropped — the feed is never torn down for the lifetime of the adapter
(only a full `ReSubscribeAllAlarms` on reconnect clears `_alarmSubscriptionIds`, and
even that re-subscribes from `_alarmSourceSubscribers`, which no longer contains the
orphaned source, so the stale device-side feed is simply abandoned, not closed). The
tag path closes exactly this race in `HandleSubscribeCompleted` (lines 834-867) by
releasing the just-created adapter handle and clearing `_subscribesInFlight` when the
instance entry has gone; the alarm path does neither. This matters because native-alarm
sources are created/destroyed on every deploy/undeploy and instance stop, so each
mid-flight unsubscribe permanently leaks one device-side alarm subscription and its
publish traffic.

**Recommendation**

Mirror the closed DCL-021 fix on the alarm path. In `HandleAlarmSubscribeCompleted`,
before storing the id, guard on the live subscriber set: `if
(!_alarmSourceSubscribers.ContainsKey(msg.SourceReference)) {
if (msg.Success && msg.SubscriptionId != null && _adapter is
IAlarmSubscribableConnection alarmable) _ =
alarmable.UnsubscribeAlarmsAsync(msg.SubscriptionId); return; }` so a feed that
completed after its last subscriber left is released at the adapter rather than stored.
Additionally clear the `_alarmSubscribesInFlight` entry in `HandleUnsubscribeAlarms`
when the last subscriber leaves, so the in-flight marker does not linger. Add a
regression test that subscribes a source, sends `UnsubscribeAlarmsRequest` while the
alarm subscribe I/O is in flight, completes the subscribe, and asserts
`UnsubscribeAlarmsAsync` is called and `_alarmSubscriptionIds` /
`_alarmSubscribesInFlight` are clean.

**Resolution**

Resolved 2026-06-20 (commit `fd618cf1`): `HandleAlarmSubscribeCompleted` now guards against an orphaned completion (if no live subscriber remains for the source, it tears down the just-created adapter subscription and returns) and `HandleUnsubscribeAlarms` clears the in-flight marker when the last subscriber leaves — mirroring the DCL-021 tag-path fix. Regression test added (verified failing pre-fix).

### DataConnectionLayer-024 — No alarm-resolution retry: a failed initial alarm subscribe strands the subscriber until the next full reconnect

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:1752-1757,1790-1809,1885-1908` |

**Description**

When an initial native-alarm subscribe fails, the subscriber is left registered but
with no feed and no path to recovery short of a full connection reconnect.
`HandleSubscribeAlarms` registers the subscriber in `_alarmSourceSubscribers` (lines
1752-1757) **before** issuing the adapter subscribe, so a transition arriving
mid-subscribe is still routed. If `alarmable.SubscribeAlarmsAsync` fails, the piped
`AlarmSubscribeCompleted` reaches `HandleAlarmSubscribeCompleted` (lines 1790-1809),
which on the failure branch only logs a warning (line 1801) and replies a
`SubscribeAlarmsResponse(success: false, ...)`. It does **not** re-arm any retry, does
**not** push `NativeAlarmSourceUnavailable` to the subscriber, and does **not** drop
the source from `_alarmSourceSubscribers`. The subscriber therefore stays in the
routing map with the adapter feed never established.

Recovery only happens via `ReSubscribeAllAlarms` (lines 1885-1908), which is invoked
solely from `BecomeConnected` after a full connection reconnect (line 583). So a
transient device-side failure on the *initial* alarm subscribe — the alarm server
still booting, a momentary fault — leaves the source dark until the entire connection
happens to cycle through `Reconnecting`. This diverges from the tag path, which arms a
periodic `tag-resolution-retry` timer (lines 989-993, 1678-1682; fired by
`HandleRetryTagResolution` at line 1491) so a tag that fails to resolve at subscribe
time is retried every `TagResolutionRetryInterval` (10 s default) without needing a
reconnect. The native-alarm source has no analogous self-healing.

This is partly a **design judgment call** rather than a clear-cut defect: the M7 design
models native alarms as a read-only mirror, and "drop the source and signal the
subscriber" is an equally valid contract to "retry the alarm subscribe periodically."
The defect is that the code currently does **neither** — it silently leaves a
registered subscriber with a permanently-dark feed and no signal — which is the worst
of both.

**Recommendation**

Pick one of the two valid contracts and implement it (this is a judgment call — do not
assume which is wanted):

- **Retry:** add an `alarm-resolution-retry` periodic timer (mirroring
  `tag-resolution-retry`): track sources whose subscribe failed, re-issue
  `SubscribeAlarmsAsync` on the timer, and gate the timer start with
  `IsTimerActive` (per the DCL-022 pattern) so a burst of failed subscribes does not
  reset it.
- **Fail fast:** on the failure branch of `HandleAlarmSubscribeCompleted`, push
  `NativeAlarmSourceUnavailable` to the subscriber(s) for that source and remove the
  source from `_alarmSourceSubscribers` / `_alarmSubscribesInFlight`, so the
  `NativeAlarmActor` learns the feed is unavailable rather than waiting forever.

Either way, add a regression test for "initial alarm subscribe fails → subscriber is
not silently left dark."

**Resolution**

Deferred 2026-06-20: the fix (add an alarm-resolution retry timer vs. push `NativeAlarmSourceUnavailable` and drop the source on initial-subscribe failure) is a design-owner decision; the current code does neither. Awaiting that decision before implementing.

### DataConnectionLayer-025 — `MxGatewayDataConnection.DisposeAsync` abandons `_alarmCts`, leaking the alarm task on failover/stop

| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Adapters/MxGatewayDataConnection.cs:277-283,121-134` |

**Description**

`MxGatewayDataConnection.DisposeAsync` (lines 277-283) cancels `_eventLoopCts` and
disposes the underlying `_client`, but never touches `_alarmCts`. The alarm feed is
established in `SubscribeAlarmsAsync` (lines 153-175): under `_alarmLock` it lazily
creates `_alarmCts = new CancellationTokenSource()` and launches a long-running
`Task.Run(() => client.RunAlarmStreamAsync(null, ..., token))`. That background alarm
stream is bound only to `_alarmCts.Token`. `DisconnectAsync` (lines 121-134) already
tears it down correctly — under `_alarmLock` it cancels and disposes `_alarmCts`,
nulls it, and resets `_alarmSubCount` — but `DisposeAsync` does not, so the
`CancellationTokenSource` and the alarm-stream `Task` are both leaked whenever the
adapter is disposed without a prior `DisconnectAsync`.

The `DataConnectionActor` disposes adapters fire-and-forget on failover (and on
actor/connection stop) via `_adapter.DisposeAsync()` without necessarily calling
`DisconnectAsync` first, so every MxGateway failover or connection teardown that goes
through `DisposeAsync` leaks one running alarm-stream task plus its CTS. Severity is
Low because the leaked task is cancellation-bound to a CTS that will be GC-reclaimed
eventually and the gRPC stream it holds will fault when `_client` is disposed, but a
still-running alarm-stream loop racing a disposed client is precisely the class of
dangling background work the lock-guarded `DisconnectAsync` block was written to avoid.

**Recommendation**

In `DisposeAsync`, cancel and dispose `_alarmCts` under `_alarmLock` before disposing
the client — copy the block already present in `DisconnectAsync` (lines 124-130):
`lock (_alarmLock) { _alarmCts?.Cancel(); _alarmCts?.Dispose(); _alarmCts = null;
_alarmSubCount = 0; }`. This guarantees the alarm-stream task is cancelled
deterministically on every teardown path, not just the `DisconnectAsync` one.

**Resolution**

Resolved 2026-06-20 (commit `fd618cf1`): `MxGatewayDataConnection.DisposeAsync` now cancels and disposes `_alarmCts` under `_alarmLock`, matching `DisconnectAsync` — no more alarm task/CTS leak on failover/stop.

### DataConnectionLayer-028 — `_alarmSourceFilter` "first subscriber wins" comment contradicts the last-wins code

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.DataConnectionLayer/Actors/DataConnectionActor.cs:102-103,1758-1761` |

**Description**

The XML doc on the `_alarmSourceFilter` field (lines 102-103) reads "sourceReference →
raw condition filter string passed to the adapter (**first subscriber wins**)." The
code does the opposite: `HandleSubscribeAlarms` **unconditionally** overwrites the
filter on every subscribe — `_alarmSourceFilter[request.SourceReference] =
request.ConditionFilter;` (line 1758) and likewise re-parses
`_alarmSourceFilterPredicate[request.SourceReference] =
AlarmConditionFilter.Parse(request.ConditionFilter)` (line 1761) — with no
"already present" guard. So a **second** subscriber to the same source reference
re-filters the shared feed with its own condition filter, i.e. **last subscriber
wins**, not first. The comment will mislead a maintainer reasoning about which filter
governs a shared alarm feed when two instances subscribe to the same source with
different condition filters, and it understates a real behavioural subtlety (the second
subscriber silently changes the gate applied to the first subscriber's transitions in
`HandleAlarmTransitionReceived`).

**Recommendation**

Correct the comment to "last subscriber wins" (and note that the shared feed carries a
single filter, so co-subscribers to one source reference must agree on the condition
filter). If per-subscriber filtering is actually intended, make the predicate
per-subscriber rather than per-source so each subscriber's own filter governs only its
own deliveries — but that is a behaviour change, not a comment fix, and should be
decided explicitly.

**Resolution**

Resolved 2026-06-20 (commit `fd618cf1`): the `_alarmSourceFilter` comment corrected from 'first subscriber wins' to 'last subscriber wins' (and notes co-subscribers share the single filter).