docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and records the 2026-05-16 baseline review (commit 9c60592): 241 findings across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low). This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions
--- a/code-reviews/DataConnectionLayer/findings.md
+++ b/code-reviews/DataConnectionLayer/findings.md
@@ -0,0 +1,471 @@
+# Code Review — DataConnectionLayer
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.DataConnectionLayer` |
+| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-16 |
+| Reviewer | claude-agent |
+| Commit reviewed | `9c60592` |
+| Open findings | 13 |
+
+## Summary
+
+The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
+lifecycle state machine, the captured-`Self` marshalling of background-thread
+disconnect events, and the protocol-factory abstraction all follow the design doc
+and Akka.NET conventions. However, the review found one **critical** actor-model
+violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private
+dictionaries and counters from a thread-pool thread, racing with the actor's own
+message loop. Several **high**-severity issues cluster around concurrency and error
+handling: the subscription-failure path leaves the connection with degraded subtrees
+but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops
+all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored-
+item callback dictionary is mutated without synchronization while OPC UA notification
+threads read it. The remaining findings concern stale health counters after failover,
+an unused `WriteTimeout` option (writes are unbounded despite the design promising a
+30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between
+the design doc's failover state machine and the implemented unstable-disconnect
+heuristic. Test coverage is adequate for the happy paths and failover but absent for
+tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
+| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
+| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
+| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
+| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
+| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
+| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
+| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
+| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
+| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
+
+## Findings
+
+### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
+
+| | |
+|--|--|
+| Severity | Critical |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` |
+
+**Description**
+
+`HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool
+thread and directly mutates the actor's private mutable state: `instanceTags` (a
+reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`,
+`_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and
+written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`,
+`ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a
+direct violation of the Akka.NET actor model, which guarantees single-threaded access
+to actor state only when state is touched on the actor thread. Two concurrent
+subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`,
+produce data races on `Dictionary`/`HashSet`/`int` — `Dictionary` is not thread-safe
+and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
+also produce torn reads of the health counters.
+
+**Recommendation**
+
+Do not mutate actor state from the background task. Perform only the `await
+_adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results
+into a local immutable result object, and `PipeTo(Self)` an internal message (e.g.
+`SubscribeCompleted`) whose handler — running on the actor thread — applies all state
+mutations and counter updates. The response to `Sender` should be sent from that
+handler too.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
+
+**Description**
+
+`DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with
+`Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a
+fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`,
+`_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are
+silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the
+design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
+that had subscribed before the crash never get their tags re-subscribed and will sit
+at uncertain/stale quality indefinitely with no error returned. There is no durable
+subscription store from which a restarted actor could rebuild state.
+
+**Recommendation**
+
+Either (a) make the subscription registry durable/recoverable so a restarted actor
+can rebuild it (persist to local SQLite as the design doc says connection definitions
+are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash
+as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the
+subscription registrations. At minimum document that subscribers must re-register
+after a crash and surface the lost-state condition rather than failing silently.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
+
+**Description**
+
+`_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are
+written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the
+`DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool
+threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from
+the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on
+the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line
+163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined
+behaviour — they can throw `InvalidOperationException`, return wrong entries, or
+corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off
+the actor thread (finding 001), so multiple subscribe calls can also race each other
+here.
+
+**Recommendation**
+
+Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all
+access with a lock. Note that fixing finding 001 (serialising subscribe through the
+actor thread) reduces but does not eliminate the race, because the SDK notification
+threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
+`DisconnectAsync`.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
+
+**Description**
+
+When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task,
+the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`,
+treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync`
+also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA
+client is not connected, and throws on transport faults — these are connection
+problems, not bad tag paths. They get misclassified as unresolved tags and retried on
+the 10 s tag-resolution timer instead of triggering the reconnection state machine.
+Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
+must be marked quality `bad`; the code never pushes a bad-quality update to the
+subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
+stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends
+to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`).
+
+**Recommendation**
+
+Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the
+reconnect machine handle them) from genuine node-not-found errors. For genuine
+resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing
+Instance Actor so it reflects the documented behaviour.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
+
+**Description**
+
+`DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared
+Settings" table both promise a bounded timeout for synchronous device writes. The
+value is never read anywhere in the module (`grep` confirms only the declaration).
+`HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no
+`CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no
+RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling
+script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
+error. The design states write failures (including timeout) must be returned
+synchronously to the script; an unbounded write violates that.
+
+**Recommendation**
+
+Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to
+`WriteAsync`, and in the continuation translate cancellation into a failed
+`WriteTagResponse` with a timeout error message. Apply the same to the read used by
+the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` |
+
+**Description**
+
+`ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a
+clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and
+`_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect)
+sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a
+reconnect, `HandleTagValueReceived` decrements the *old* bucket using
+`_lastTagQuality`'s value and increments the new one — but tags resolved for the first
+time after reconnect were never in `_lastTagQuality`, so they only increment, never
+decrement, and the totals can drift above `_totalSubscribed`. Over repeated
+disconnect/reconnect cycles the health report's good/bad/uncertain counts become
+unreliable.
+
+**Recommendation**
+
+On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear
+`_lastTagQuality` and the three quality counters and let them be repopulated from
+fresh `TagValueReceived` messages. Alternatively recompute the buckets from
+`_lastTagQuality` whenever it changes rather than maintaining incremental counters.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` |
+
+**Description**
+
+`ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any
+non-cancellation exception (line 184). So if any single tag in the batch throws (bad
+node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no
+results for the tags that *did* read successfully — even though `ReadResult` already
+has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is
+also fully serial (one round-trip per tag), defeating the point of a batch API; the
+design doc lists `ReadBatch`/`WriteBatch` as first-class operations.
+
+**Recommendation**
+
+Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag
+so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call
+for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a
+`ReadValueIdCollection` — extend it to accept multiple nodes).
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` |
+
+**Description**
+
+For each tag of the instance being removed, `HandleUnsubscribe` scans every other
+instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation
+O(tags x instances). On a site with many instances sharing a connection this is
+needlessly expensive on every instance stop/redeploy. Separately, line 562
+re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already
+removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead
+logic that obscures intent (the decrement of `_resolvedTags` is unconditional in
+practice).
+
+**Recommendation**
+
+Maintain a reference count per tag path (or a `tagPath -> set<instance>` reverse index)
+so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags`
+re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
+prior state captured before removal.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` |
+
+**Description**
+
+The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
+(>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate*
+failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on
+connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally
+counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped
+within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count
+too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
+happen on *successful-but-flaky* connections are not described in the component doc at
+all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not
+predict this behaviour, and the 60 s threshold is a magic constant not exposed via
+`DataConnectionOptions`.
+
+**Recommendation**
+
+Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover
+path and the stability threshold, and move the 60 s threshold into
+`DataConnectionOptions` so it is configurable and consistent with the other tunables.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` |
+
+**Description**
+
+`HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags`
+via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from
+`_unresolvedTags` while the attempts are in flight. Because tags are not removed
+before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues
+duplicate concurrent subscribe attempts for the same tag, which can create duplicate
+monitored items / leaked subscription IDs (the second success overwrites
+`_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle
+with no `UnsubscribeAsync` call). The timer-cancel condition in
+`HandleTagResolutionSucceeded` is also non-deterministic for the same reason.
+
+**Recommendation**
+
+Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is
+dispatched, and only put them back on failure. This prevents overlapping duplicate
+subscribe attempts and makes the timer-cancel condition deterministic.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` |
+
+**Description**
+
+On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`,
+fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
+captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the
+`Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches
+`Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads
+have not yet fully torn down could still deliver a value, mixing pre-failover device
+data with the new endpoint's data and briefly reporting a value the active endpoint
+never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to
+distinguish current from stale callbacks.
+
+**Recommendation**
+
+Add an adapter-generation counter incremented on every adapter swap; stamp it onto
+`TagValueReceived` (captured in the callback closure) and drop messages whose
+generation does not match the current adapter in `HandleTagValueReceived`.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` |
+
+**Description**
+
+`OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and
+`RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true`
+when it is set. With the default, every server certificate is accepted unconditionally
+— there is no certificate-pinning or trust-store enforcement — which defeats the
+`Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC
+UA link. The design doc explicitly lists `true` as the default. For an industrial
+control link this is a meaningful exposure; a secure-by-default posture would reject
+untrusted certs unless an operator opts in per connection.
+
+**Recommendation**
+
+Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection
+opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
+installed. Update the design doc to reflect the secure default.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` |
+
+**Description**
+
+The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers
+the event." The implementation is a non-atomic check-then-set on a `volatile bool`
+(`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees
+visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
+`OnClientConnectionLost` and a `ReadAsync` failure path) can both observe
+`_disconnectFired == false` and both invoke `Disconnected`. In practice the
+`DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact
+is low, but the comment overstates the guarantee. The same pattern exists in
+`RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`).
+
+**Recommendation**
+
+Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock),
+or correct the comment to say "best-effort once-only; a duplicate event is possible
+under a race and is tolerated downstream."
+
+**Resolution**
+
+_Unresolved._