code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+318 -4
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 5 |
## Summary
@@ -30,6 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
#### Re-review 2026-05-28 (commit `1eb6e97`)
The 2026-05-28 re-review walked all 10 checklist categories against the current
source and found **5 new findings**. All 17 prior findings remain `Resolved` and the
fixes (reverse-index unsubscribe, atomic disconnect guards, real-logger threading,
initial-connect failover, per-tag write-batch results, subscribe-response accuracy)
were verified in place. The new findings cluster around `HandleSubscribe` /
`HandleSubscribeCompleted` race-induced state drift:
- **High** — concurrent subscribes for the same tag from different instances each see
the tag as not-yet-subscribed (the `alreadySubscribed` snapshot was taken before
the Task.Run dispatch), so each Task.Run calls `_adapter.SubscribeAsync` and the
later `HandleSubscribeCompleted` silently discards the second adapter subscription
handle — the orphan never gets `UnsubscribeAsync`'d.
- **Medium** — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>`
mutated from thread-pool continuations of `SubscribeAsync` / `UnsubscribeAsync` /
`DisconnectAsync` running in parallel — the same class of bug DCL-003 fixed in
`RealOpcUaClient` but missed in the layer above.
- **Medium** — `HandleSubscribeCompleted`'s success branch never checks
`_unresolvedTags`, so a tag that previously failed resolution (incrementing
`_totalSubscribed`) and is then successfully subscribed by a different instance gets
`_totalSubscribed++` a second time, double-counting; meanwhile the unresolved entry
lingers until the retry timer also resolves it, creating an orphaned monitored item.
- **Medium** — when an instance is unsubscribed mid-flight,
`HandleSubscribeCompleted` re-creates an empty `_subscriptionsByInstance[name]`
entry and processes the late results, leaking `_tagSubscriberCount` /
`_totalSubscribed` / `_resolvedTags` increments for an instance with no
`_subscribers` entry to deliver values to.
- **Medium** — `HandleSubscribeCompleted` calls `Timers.StartPeriodicTimer` on every
completed subscribe with unresolved tags; in Akka.NET, `StartPeriodicTimer` with the
same key cancels and replaces the existing timer, so a burst of subscribes arriving
faster than `TagResolutionRetryInterval` (10 s default) keeps resetting the timer
and the retry never actually fires.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
@@ -50,7 +84,22 @@ so a mid-batch disconnect aborts the whole write batch (the same class of defect
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
`DataConnectionLayer-014`.
## Checklist coverage
## Checklist coverage (2026-05-28 re-review, commit `1eb6e97`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | Findings 020 (double-count `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance) and 021 (leaked `_subscriptionsByInstance` entry + counter increments when instance unsubscribes mid-flight). |
| 2 | Akka.NET conventions | x | Finding 022 — `Timers.StartPeriodicTimer` reset on every `HandleSubscribeCompleted` for unresolved tags can stall the retry timer indefinitely under a subscribe burst. |
| 3 | Concurrency & thread safety | x | Finding 018 — concurrent subscribes for the same tag from different instances each spawn an adapter subscription and the second handle is orphaned. Finding 019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from thread-pool continuations (same class of bug as DCL-003 one layer above). |
| 4 | Error handling & resilience | x | No new issues; DCL-004 / DCL-007 / DCL-015 / DCL-017 fixes verified in place. |
| 5 | Security | x | No new issues; DCL-012 / DCL-014 fixes verified. The Commons-side `OpcUaEndpointConfig.AutoAcceptUntrustedCerts = true` default surfaced in DCL-012 is still present but is out of this module's scope. |
| 6 | Performance & resource management | x | No new issues; DCL-008 reverse index verified. (Finding 018's orphaned adapter handle is logged under concurrency.) |
| 7 | Design-document adherence | x | No new issues. DCL-009's design-doc action (document unstable-disconnect failover trigger + configurable threshold) is still open at the doc level but out of this module's scope. |
| 8 | Code organization & conventions | x | No issues — POCOs in Commons, options class owned by component, factory + DI registration consistent. |
| 9 | Testing coverage | x | DCL001017 regression tests present. Gaps remain for finding 018 (concurrent subscribe of same tag from two instances), 019 (concurrent `_subscriptionHandles` mutation), 020 (resolve-via-different-instance), 021 (unsubscribe-mid-flight), 022 (timer-reset starvation). |
| 10 | Documentation & comments | x | No new issues; DCL-013 atomic-guard XML comments verified. |
## Checklist coverage (2026-05-17 re-review, commit `39d737e`)
| # | Category | Examined | Notes |
|---|----------|----------|-------|
@@ -896,3 +945,268 @@ unhandled exception. Regression test
`DCL017_WriteBatch_ReturnsPerTagResults_WhenConnectionDropsMidBatch` fails against the
pre-fix code (the batch throws, no map returned) and passes after;
`DCL017_WriteBatch_CancellationAbortsWholeBatch` guards that cancellation still aborts.
### DataConnectionLayer-018 — Concurrent subscribes for the same tag from different instances orphan an adapter subscription handle
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:557,564-594,653` |
**Description**
`HandleSubscribe` snapshots `_subscriptionIds.Keys` into a local `alreadySubscribed`
set on the actor thread before dispatching the `Task.Run` that performs the adapter
I/O (line 557). The snapshot is the only basis on which the background task decides
whether to call `_adapter.SubscribeAsync` — and it is taken **once**, before the I/O
runs.
If two `SubscribeTagsRequest` messages arrive on the actor thread for different
instances that both reference the same tag path, both `HandleSubscribe` invocations
take a snapshot at a time when neither subscribe has completed, so `alreadySubscribed`
does not contain the shared tag in either snapshot. Both background tasks then call
`_adapter.SubscribeAsync(tagPath, ...)`, the adapter creates **two** monitored items
and returns two distinct subscription ids, and each task pipes a `SubscribeCompleted`
back to the actor with `AlreadySubscribed: false, Success: true`.
`HandleSubscribeCompleted` for the first message takes the success branch and writes
`_subscriptionIds[tagPath] = subId1`. The second message arrives, hits the
"already in `_subscriptionIds`" guard at line 653 (`_subscriptionIds.ContainsKey(...)`)
and `continue`s — but `result.SubscriptionId` (the orphan handle for the second
adapter subscription) is silently discarded. The orphan monitored item stays alive in
the OPC UA session for the lifetime of the adapter, sending duplicate data-change
notifications (whose callbacks were stamped with the captured `generation`) into
`HandleTagValueReceived` for every value change. Across a deploy that creates many
instances sharing a few tags, this leaks N-1 monitored items per shared tag and
doubles/triples the per-tag publish traffic.
DCL-010 fixed an analogous duplicate-dispatch bug for the tag-resolution retry path
via `_resolutionInFlight`; the equivalent guard is missing on the user-subscribe
path.
**Recommendation**
Track in-flight subscribes the same way DCL-010 tracks in-flight retries: maintain a
`HashSet<string> _subscribesInFlight` and add `tagPath` to it on the actor thread
**before** the `Task.Run` dispatch, only for tags not already in
`_subscriptionIds` and not already in `_subscribesInFlight`. Tags that are already
in flight should produce a `SubscribeTagResult(..., AlreadySubscribed: true, ...)`
without touching the adapter. Remove from `_subscribesInFlight` in
`HandleSubscribeCompleted` once the result is applied. Add a regression test that
fans two simultaneous `SubscribeTagsRequest` messages for the same tag and asserts
exactly one `_adapter.SubscribeAsync(tag, ...)` call (and no orphan subscription id).
### DataConnectionLayer-019 — `OpcUaDataConnection._subscriptionHandles` is a plain `Dictionary<,>` mutated from concurrent thread-pool continuations
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:31,167,177`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:163-164` |
**Description**
`OpcUaDataConnection._subscriptionHandles` is declared as `Dictionary<string,
string>`. It is mutated from:
- `SubscribeAsync` (line 167): `_subscriptionHandles[subscriptionId] = tagPath;`
after an `await _client!.CreateSubscriptionAsync(...)` — i.e. the assignment
executes on the continuation thread (a thread-pool thread).
- `UnsubscribeAsync` (line 177): `_subscriptionHandles.Remove(subscriptionId);`
similarly after an `await`.
- `DisconnectAsync` indirectly via the underlying `_client.DisconnectAsync` does
**not** touch `_subscriptionHandles`, but multiple `SubscribeAsync` /
`UnsubscribeAsync` calls can run in parallel from the upper layer.
The DCL upper layer calls `_adapter.SubscribeAsync` from multiple places that all
run off the actor thread:
- `DataConnectionActor.HandleSubscribe` inside its `Task.Run` (multiple invocations
can run in parallel — see DCL-018);
- `HandleRetryTagResolution` issues `_adapter.SubscribeAsync` for every tag in
`_unresolvedTags` and pipes the continuation (each subscribe runs concurrently
via the SDK's async machinery);
- `ReSubscribeAll` does the same after a reconnect.
So plain-`Dictionary` mutations occur on multiple thread-pool threads concurrently —
the exact pattern DCL-003 fixed by switching `RealOpcUaClient._monitoredItems` and
`_callbacks` to `ConcurrentDictionary<,>`. Plain `Dictionary` mutations during a
concurrent resize are undefined behaviour: they can throw
`InvalidOperationException`, corrupt the internal hash buckets, or lose entries.
This is `_subscriptionHandles` is currently dead state (the dictionary is written to
and `Remove`d but **never read**), so a corruption today would not crash the
subscribe path — but the bug is latent and the field will become load-bearing the
moment any code reads it (e.g., to expose a subscription-id-to-tag-path lookup for
diagnostics, which is what the dictionary's name suggests it was intended for).
**Recommendation**
Either (a) change `_subscriptionHandles` to
`ConcurrentDictionary<string, string>` and use `TryAdd` / `TryRemove`, mirroring
DCL-003's fix one layer down, or (b) delete the field entirely since it is never
read — the bookkeeping is fully owned by `RealOpcUaClient._monitoredItems` /
`_callbacks` and `DataConnectionActor._subscriptionIds`. Removing it eliminates the
race and removes dead state in one stroke. Add a regression test (or extend
`DCL003_SharedDictionaryFields_AreConcurrentCollections`) that asserts no
non-concurrent `Dictionary` field is shared across thread boundaries in adapter
state.
### DataConnectionLayer-020 — `HandleSubscribeCompleted` double-counts `_totalSubscribed` when a previously-unresolved tag is resolved by a different instance's subscribe
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:653-661,670-688` |
**Description**
`HandleSubscribeCompleted`'s success branch (line 656-661) writes
`_subscriptionIds[result.TagPath] = result.SubscriptionId!; _totalSubscribed++;
_resolvedTags++;`. The guard at line 653 only skips when the tag is already in
`_subscriptionIds`; it does **not** check `_unresolvedTags`. So the success branch
runs for a tag that previously failed resolution from an earlier instance's subscribe
(which incremented `_totalSubscribed` and added the tag to `_unresolvedTags` at line
674-676) and is now successfully subscribed by a later instance.
Sequence:
1. Instance A subscribes "Tag1". `_adapter.SubscribeAsync` throws a non-connection-level
exception. `HandleSubscribeCompleted` takes the resolution-failure branch:
`_unresolvedTags.Add("Tag1"); _totalSubscribed++;` (now 1).
2. The device finishes booting. Instance B subscribes "Tag1". `_adapter.SubscribeAsync`
succeeds, returning `subId`. `HandleSubscribeCompleted` takes the success branch:
`_subscriptionIds["Tag1"] = subId; _totalSubscribed++; _resolvedTags++;`
(now `_totalSubscribed = 2`, `_resolvedTags = 1`).
3. `_unresolvedTags` still contains "Tag1". The retry timer fires next tick,
`HandleRetryTagResolution` dispatches `SubscribeAsync("Tag1", ...)` against the
adapter (creating a **second** monitored item for the same tag), and
`HandleTagResolutionSucceeded` runs `_unresolvedTags.Remove("Tag1")`
`_subscriptionIds["Tag1"] = newSubId` (overwriting Instance B's id, orphaning that
monitored item) → `_resolvedTags++` (now 2, matching `_totalSubscribed`).
Net effect:
- `_totalSubscribed` is over-counted by 1 from step 2 until step 3 reconciles
`_resolvedTags`. During that window the health report's "subscribed / resolved"
ratio is wrong.
- Two adapter subscription handles for the same tag are leaked across this race
(DCL-018's orphan plus the retry's second adapter call); the second leaks
permanently because `_subscriptionIds["Tag1"]` only stores the most recent id.
**Recommendation**
In `HandleSubscribeCompleted`'s success branch, before the `_totalSubscribed++`,
check `_unresolvedTags.Remove(result.TagPath)` — if the tag was already counted as
unresolved, promote it without re-incrementing `_totalSubscribed` (mirror
`HandleTagResolutionSucceeded`'s shape: only increment `_resolvedTags`,
`_subscriptionIds[tag] = subId`, and clear `_resolutionInFlight`). Add a regression
test that asserts `_totalSubscribed` / `_resolvedTags` consistency after the
"resolve via a second instance" sequence.
### DataConnectionLayer-021 — `HandleSubscribeCompleted` re-creates and leaks `_subscriptionsByInstance` entry when the instance unsubscribed mid-flight
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:626-634,642-687` |
**Description**
`HandleSubscribe` dispatches a `Task.Run` that performs adapter I/O off the actor
thread and pipes a `SubscribeCompleted` back. If an `UnsubscribeTagsRequest` for the
same instance is processed on the actor thread between dispatch and completion,
`HandleUnsubscribe` removes the instance from both `_subscriptionsByInstance` and
`_subscribers`. When the late `SubscribeCompleted` arrives,
`HandleSubscribeCompleted` (line 629-634) **re-creates** the
`_subscriptionsByInstance[instanceName] = new HashSet<string>()` entry and proceeds
to apply the results — but `_subscribers[instanceName]` was already removed by the
unsubscribe and is **not** re-added.
Consequences:
1. `_subscriptionsByInstance` keeps a permanently-leaked entry for an instance that
no longer exists. `ReSubscribeAll` derives its tag list from
`_subscriptionsByInstance.Values` and will keep re-subscribing the leaked tags on
every future reconnect.
2. For each tag, `_tagSubscriberCount[tagPath]` is incremented (line 647-649), so the
reverse index treats the leaked instance as a real subscriber. The only way to
drop the count is another `HandleUnsubscribe` for the same instance — which can
never arrive because the Instance Actor that owned the instance is gone.
3. The success branch increments `_totalSubscribed` / `_resolvedTags` (or
`_unresolvedTags` for genuine resolution failures), drifting health counters
permanently above the actual subscribed instance count.
4. Subsequent `HandleTagValueReceived` fanout iterates `_subscriptionsByInstance` and
skips this entry via the `_subscribers.TryGetValue` check (line 1019), so values
are silently dropped — but the work of fanning them out (the iteration and the
tag lookup) is still done for every value update on every leaked tag, forever.
5. The genuine-resolution-failure path at line 682-686 (`subscriber.Tell(new
TagValueUpdate(..., QualityCode.Bad, ...))`) also silently no-ops because
`_subscribers.TryGetValue` is false — so the design doc's "push bad quality on
resolution failure" promise is broken for this case (a minor, edge-case wrinkle).
**Recommendation**
In `HandleSubscribeCompleted`, when `_subscriptionsByInstance.TryGetValue` fails,
treat the result as obsolete: log it and `return` without re-creating the entry or
applying any state mutations. Any successfully-created adapter subscriptions in
`msg.Results` should be cleaned up — iterate the results and
`_adapter.UnsubscribeAsync(result.SubscriptionId!)` (fire-and-forget) for each
successful one so the orphan handles do not leak in the adapter. Add a regression
test that subscribes from instance A, immediately sends an `UnsubscribeTagsRequest`
for A while the subscribe I/O is in flight, completes the subscribe, and asserts
`_subscriptionsByInstance`, `_tagSubscriberCount` and health counters are all clean.
### DataConnectionLayer-022 — `HandleSubscribeCompleted` and `HandleTagResolutionFailed` reset the tag-resolution retry timer on every call via `StartPeriodicTimer`, starving the retry under subscribe bursts
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:691-698,991-998` |
**Description**
`HandleSubscribeCompleted` (line 691-698) and `HandleTagResolutionFailed` (line
991-998) both call:
```
Timers.StartPeriodicTimer(
"tag-resolution-retry",
new RetryTagResolution(),
_options.TagResolutionRetryInterval,
_options.TagResolutionRetryInterval);
```
`Akka.Actor.ITimerScheduler.StartPeriodicTimer(key, ...)` cancels and replaces any
existing timer registered under the same key. So every additional subscribe (or
every additional tag-resolution failure) that produces unresolved tags **resets** the
retry timer's countdown to the full interval — the timer never accumulates
elapsed time across calls.
With the default `TagResolutionRetryInterval = 10s`, an instance-startup burst that
produces a new `SubscribeTagsRequest` every 5s (a not-unusual cadence during
deployment fan-out) will keep cancelling the not-yet-fired retry every 5s, so the
"periodic" retry never actually fires until subscribes go quiet. In a steady-state
site with many instances deploying together this can delay tag resolution by tens
of seconds, leaving attributes at `Bad` quality longer than the documented retry
interval implies.
**Recommendation**
Start the periodic timer once, when the actor first transitions to having
non-empty `_unresolvedTags`, and only re-start it after `Timers.Cancel(...)` has
been called (e.g., when the actor enters `Reconnecting`). The cleanest pattern is to
gate the start with `if (!Timers.IsTimerActive("tag-resolution-retry"))` before
calling `StartPeriodicTimer` — `IsTimerActive` is on `ITimerScheduler`. Apply the
same gate at both call sites. Add a regression test that fires 5 subscribes with
unresolved tags within one retry interval and asserts the retry fires at most one
interval after the first failure (not after the fifth subscribe).