docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring

17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
2026-05-17 00:45:10 -04:00
parent e49846603e
commit 89636e2bbf
6 changed files with 895 additions and 64 deletions
--- a/code-reviews/DataConnectionLayer/findings.md
+++ b/code-reviews/DataConnectionLayer/findings.md
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.DataConnectionLayer` |
 | Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-16 |
+| Last reviewed | 2026-05-17 |
 | Reviewer | claude-agent |
-| Commit reviewed | `9c60592` |
-| Open findings | 0 |
+| Commit reviewed | `39d737e` |
+| Open findings | 4 |

 ## Summary

@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
 heuristic. Test coverage is adequate for the happy paths and failover but absent for
 tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.

+#### Re-review 2026-05-17 (commit `39d737e`)
+
+All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
+verified in place against the current source (`PipeTo(Self)` subscribe pattern,
+`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
+bounded write timeout, etc.). The re-review walked all 10 checklist categories again
+and found **4 new findings**: one **High** — the DCL-012 security warning is never
+seen in production because `RealOpcUaClientFactory.Create()` constructs
+`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
+**Medium** — initial-connect failures in the `Connecting` state never count toward
+failover, so a connection whose primary endpoint is unreachable at startup retries the
+primary forever and never tries the configured backup; one **Medium** —
+`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
+when a connection-level subscribe failure is driving the actor into `Reconnecting`,
+telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
+`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
+so a mid-batch disconnect aborts the whole write batch (the same class of defect
+DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
+`DataConnectionLayer-014`.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
 |---|----------|----------|-------|
-| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
-| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
-| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
-| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
-| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
-| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
-| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
+| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
+| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
+| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
+| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
+| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
+| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
+| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
 | 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
-| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
-| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
+| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
+| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |

 ## Findings

@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
 simultaneously, and asserts `Disconnected` fires exactly once per round; against a
 non-atomic check-then-set it double-fires (verified by temporarily reverting the
 guard), and it passes against the atomic fix.
+
+### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
+
+**Description**
+
+Finding DataConnectionLayer-012 was resolved in part by adding a prominent
+`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
+certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
+`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
+`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
+
+However, the only production code path that constructs a `RealOpcUaClient` is
+`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
+`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
+holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
+As a result the `_logger` field is always `NullLogger` for every real OPC UA
+connection, and the man-in-the-middle warning the DCL-012 fix added is silently
+discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
+enabled — accepting any server certificate on an industrial control link — gets no
+visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
+therefore not actually effective in production; only the unit test
+(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
+checks the default value) passes.
+
+**Recommendation**
+
+Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
+holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
+give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
+constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
+each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
+real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
+the default is `false`.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
+
+**Description**
+
+Failover between the primary and backup endpoints is implemented in two places, both
+reachable only after a connection has already been `Connected` at least once:
+`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
+endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
+
+`HandleConnectResult` — the handler for the *initial* connection attempt in the
+`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
+only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
+`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
+
+Consequence: if the primary endpoint is unreachable when the connection actor first
+starts — which is the common case after a fresh artifact deployment, a site restart,
+or a primary that is simply down at that moment — the actor retries the *primary*
+endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
+backup. The design doc's endpoint-redundancy promise ("automatic failover when the
+active endpoint becomes unreachable") is silently not honoured for the
+never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
+despite a healthy backup being configured.
+
+**Recommendation**
+
+Make `HandleConnectResult` participate in the failover counter the same way
+`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
+`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
+endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
+log the failover event) before re-arming the timer. Alternatively, fold the initial
+connect into the same reconnect path so there is a single failover decision point. Add
+a regression test for "primary down at startup, backup configured → fails over to
+backup".
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
+
+**Description**
+
+`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
+so the `Connected`-state handler can drive the actor into `Reconnecting`
+(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
+to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
+`Success: true`, `Error: null` — regardless of whether any tag failed at connection
+level.
+
+So when a subscribe arrives while the adapter is silently down, the Instance Actor is
+told the subscribe **succeeded**, while the connection actor simultaneously transitions
+to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
+block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
+and when reconnection succeeds. The caller has no way to distinguish "subscribed and
+healthy" from "accepted, but the connection is currently down" — a misleading
+success signal on a request that did not do what the response claims.
+
+(Genuine tag-resolution failures are arguably also reported as overall `true`, but
+that is defensible: those tags are tracked in `_unresolvedTags` and the design models
+unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
+already pushed. The connection-level case is the clear defect because the actor itself
+treats it as a failure worth a state transition.)
+
+**Recommendation**
+
+When `connectionLevelFailure` is true, reply with
+`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
+re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
+the actor's own assessment. Optionally carry per-tag outcomes in the response so the
+Instance Actor can reflect partial success. Add a test asserting the response is not
+`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
+
+**Resolution**
+
+_Unresolved._
+
+### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
+
+**Description**
+
+`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
+`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
+status does not abort the batch), but it first calls `EnsureConnected()`
+(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
+client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
+connection drops partway through a batch the whole `WriteBatchAsync` throws and the
+caller gets no result map — losing the per-tag outcomes for the tags that already
+wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
+`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
+propagates `OperationCanceledException`). `WriteBatchAsync` feeds
+`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
+sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
+
+Severity is Low because device writes are real-time control operations with no
+store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
+(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
+error condition — but the inconsistent error shape (exception vs. per-tag result) is a
+maintainability and correctness wart.
+
+**Recommendation**
+
+Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
+try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
+and continues, while still propagating `OperationCanceledException` to abort a
+cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
+complete, consistent result map.
+
+**Resolution**
+
+_Unresolved._