code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 5 |

 ## Summary

@@ -30,6 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
 heuristic. Test coverage is adequate for the happy paths and failover but absent for
 tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+The 2026-05-28 re-review walked all 10 checklist categories against the current
+source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
+fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
+initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
+were verified in place. The new findings cluster around `HandleSubscribe` /
+`HandleSubscribeCompleted` race-induced state drift:
+
+- **High** — concurrent subscribes for the same tag from different instances each see
+  the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
+  the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
+  later `HandleSubscribeCompleted` silently discards the second adapter subscription
+  handle — the orphan never gets `UnsubscribeAsync`'d.
+- **Medium** — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
+  mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
+  `DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
+  `RealOpcUaClient` but missed in the layer above.
+- **Medium** — `HandleSubscribeCompleted`'s success branch never checks
+  `_unresolvedTags`, so a tag that previously failed resolution (incrementing
+  `_totalSubscribed`) and is then successfully subscribed by a different instance gets
+  `_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
+  lingers until the retry timer also resolves it, creating an orphaned monitored item.
+- **Medium** — when an instance is unsubscribed mid-flight,
+  `HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
+  entry and processes the late results, leaking `_tagSubscriberCount` /
+  `_totalSubscribed` / `_resolvedTags` increments for an instance with no
+  `_subscribers` entry to deliver values to.
+- **Medium** — `HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
+  completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
+  same key cancels and replaces the existing timer, so a burst of subscribes arriving
+  faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
+  and the retry never actually fires.
+
 #### Re-review 2026-05-17 (commit `39d737e`)

 All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
@@ -50,7 +84,22 @@ so a mid-batch disconnect aborts the whole write batch (the same class of defect
 DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
 `DataConnectionLayer-014`.

-## Checklist coverage
+## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
+| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
+| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
+| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
+| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
+| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
+| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
+| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
+| 9 | Testing coverage | x | DCL001–017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
+| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
+
+## Checklist coverage (2026-05-17 re-review, commit `39d737e`)

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
@@ -896,3 +945,268 @@ unhandled exception. Regression test
 `DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
 pre-fix code (the batch throws, no map returned) and passes after;
 `DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.
+
+### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
+
+**Description**
+
+`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
+set on the actor thread before dispatching the `Task.Run` that performs the adapter
+I/O (line 557). The snapshot is the only basis on which the background task decides
+whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
+runs.
+
+If two `SubscribeTagsRequest` messages arrive on the actor thread for different
+instances that both reference the same tag path, both `HandleSubscribe` invocations
+take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
+does not contain the shared tag in either snapshot. Both background tasks then call
+`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
+and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
+back to the actor with `AlreadySubscribed: false, Success: true`.
+
+`HandleSubscribeCompleted` for the first message takes the success branch and writes
+`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
+"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
+and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
+adapter subscription) is silently discarded. The orphan monitored item stays alive in
+the OPC UA session for the lifetime of the adapter, sending duplicate data-change
+notifications (whose callbacks were stamped with the captured `generation`) into
+`HandleTagValueReceived` for every value change. Across a deploy that creates many
+instances sharing a few tags, this leaks N-1 monitored items per shared tag and
+doubles/triples the per-tag publish traffic.
+
+DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
+via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
+path.
+
+**Recommendation**
+
+Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
+`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
+**before** the `Task.Run` dispatch, only for tags not already in
+`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
+in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
+without touching the adapter. Remove from `_subscribesInFlight` in
+`HandleSubscribeCompleted` once the result is applied. Add a regression test that
+fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
+exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).
+
+### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
+
+**Description**
+
+`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
+string>`. It is mutated from:
+
+- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
+  after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
+  executes on the continuation thread (a thread-pool thread).
+- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
+  similarly after an `await`.
+- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
+  **not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
+  `UnsubscribeAsync` calls can run in parallel from the upper layer.
+
+The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
+run off the actor thread:
+
+- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
+  can run in parallel — see DCL-018);
+- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
+  `_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
+  via the SDK's async machinery);
+- `ReSubscribeAll` does the same after a reconnect.
+
+So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
+the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
+`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
+concurrent resize are undefined behaviour: they can throw
+`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.
+
+This is `_subscriptionHandles` is currently dead state (the dictionary is written to
+and `Remove`d but **never read**), so a corruption today would not crash the
+subscribe path — but the bug is latent and the field will become load-bearing the
+moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
+diagnostics, which is what the dictionary's name suggests it was intended for).
+
+**Recommendation**
+
+Either (a) change `_subscriptionHandles` to
+`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
+DCL-003's fix one layer down, or (b) delete the field entirely since it is never
+read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
+`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
+race and removes dead state in one stroke. Add a regression test (or extend
+`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
+non-concurrent `Dictionary` field is shared across thread boundaries in adapter
+state.
+
+### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
+
+**Description**
+
+`HandleSubscribeCompleted`'s success branch (line 656-661) writes
+`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
+_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
+`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
+runs for a tag that previously failed resolution from an earlier instance's subscribe
+(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
+674-676) and is now successfully subscribed by a later instance.
+
+Sequence:
+
+1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
+   exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
+   `_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
+2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
+   succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
+   `_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
+   (now `_totalSubscribed = 2`, `_resolvedTags = 1`).
+3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
+   `HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
+   adapter (creating a **second** monitored item for the same tag), and
+   `HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")` →
+   `_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
+   monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).
+
+Net effect:
+
+- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
+  `_resolvedTags`. During that window the health report's "subscribed / resolved"
+  ratio is wrong.
+- Two adapter subscription handles for the same tag are leaked across this race
+  (DCL-018's orphan plus the retry's second adapter call); the second leaks
+  permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.
+
+**Recommendation**
+
+In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
+check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
+unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
+`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
+`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
+test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
+"resolve via a second instance" sequence.
+
+### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
+
+**Description**
+
+`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
+thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
+same instance is processed on the actor thread between dispatch and completion,
+`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
+`_subscribers`. When the late `SubscribeCompleted` arrives,
+`HandleSubscribeCompleted` (line 629-634) **re-creates** the
+`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
+to apply the results — but `_subscribers[instanceName]` was already removed by the
+unsubscribe and is **not** re-added.
+
+Consequences:
+
+1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
+   no longer exists. `ReSubscribeAll` derives its tag list from
+   `_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
+   every future reconnect.
+2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
+   reverse index treats the leaked instance as a real subscriber. The only way to
+   drop the count is another `HandleUnsubscribe` for the same instance — which can
+   never arrive because the Instance Actor that owned the instance is gone.
+3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
+   `_unresolvedTags` for genuine resolution failures), drifting health counters
+   permanently above the actual subscribed instance count.
+4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
+   skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
+   are silently dropped — but the work of fanning them out (the iteration and the
+   tag lookup) is still done for every value update on every leaked tag, forever.
+5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
+   TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
+   `_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
+   resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
+
+**Recommendation**
+
+In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
+treat the result as obsolete: log it and `return` without re-creating the entry or
+applying any state mutations. Any successfully-created adapter subscriptions in
+`msg.Results` should be cleaned up — iterate the results and
+`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
+successful one so the orphan handles do not leak in the adapter. Add a regression
+test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
+for A while the subscribe I/O is in flight, completes the subscribe, and asserts
+`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.
+
+### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
+
+**Description**
+
+`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
+991-998) both call:
+
+```
+Timers.StartPeriodicTimer(
+    "tag-resolution-retry",
+    new RetryTagResolution(),
+    _options.TagResolutionRetryInterval,
+    _options.TagResolutionRetryInterval);
+```
+
+`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
+existing timer registered under the same key. So every additional subscribe (or
+every additional tag-resolution failure) that produces unresolved tags **resets** the
+retry timer's countdown to the full interval — the timer never accumulates
+elapsed time across calls.
+
+With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
+produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
+deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
+"periodic" retry never actually fires until subscribes go quiet. In a steady-state
+site with many instances deploying together this can delay tag resolution by tens
+of seconds, leaving attributes at `Bad` quality longer than the documented retry
+interval implies.
+
+**Recommendation**
+
+Start the periodic timer once, when the actor first transitions to having
+non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
+been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
+gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
+calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply the
+same gate at both call sites. Add a regression test that fires 5 subscribes with
+unresolved tags within one retry interval and asserts the retry fires at most one
+interval after the first failure (not after the fifth subscribe).