docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring

17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
This commit is contained in:
Joseph Doherty
2026-05-17 00:45:10 -04:00
parent e49846603e
commit 89636e2bbf
6 changed files with 895 additions and 64 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
and found **4 new findings**: one **High** — the DCL-012 security warning is never
seen in production because `RealOpcUaClientFactory.Create()` constructs
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
**Medium** — initial-connect failures in the `Connecting` state never count toward
failover, so a connection whose primary endpoint is unreachable at startup retries the
primary forever and never tries the configured backup; one **Medium**
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
telling the Instance Actor the subscribe succeeded when it did not; and one **Low**
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
so a mid-batch disconnect aborts the whole write batch (the same class of defect
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
`DataConnectionLayer-014`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
| 9 | Testing coverage | x | DCL001013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
## Findings
@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
non-atomic check-then-set it double-fires (verified by temporarily reverting the
guard), and it passes against the atomic fix.
### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
**Description**
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
However, the only production code path that constructs a `RealOpcUaClient` is
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
As a result the `_logger` field is always `NullLogger` for every real OPC UA
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
enabled — accepting any server certificate on an industrial control link — gets no
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
therefore not actually effective in production; only the unit test
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
checks the default value) passes.
**Recommendation**
Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)`
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
the default is `false`.
**Resolution**
_Unresolved._
### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
**Description**
Failover between the primary and backup endpoints is implemented in two places, both
reachable only after a connection has already been `Connected` at least once:
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
`HandleConnectResult` — the handler for the *initial* connection attempt in the
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
Consequence: if the primary endpoint is unreachable when the connection actor first
starts — which is the common case after a fresh artifact deployment, a site restart,
or a primary that is simply down at that moment — the actor retries the *primary*
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
active endpoint becomes unreachable") is silently not honoured for the
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
despite a healthy backup being configured.
**Recommendation**
Make `HandleConnectResult` participate in the failover counter the same way
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
log the failover event) before re-arming the timer. Alternatively, fold the initial
connect into the same reconnect path so there is a single failover decision point. Add
a regression test for "primary down at startup, backup configured → fails over to
backup".
**Resolution**
_Unresolved._
### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
**Description**
`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
so the `Connected`-state handler can drive the actor into `Reconnecting`
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
level.
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
told the subscribe **succeeded**, while the connection actor simultaneously transitions
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
healthy" from "accepted, but the connection is currently down" — a misleading
success signal on a request that did not do what the response claims.
(Genuine tag-resolution failures are arguably also reported as overall `true`, but
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
already pushed. The connection-level case is the clear defect because the actor itself
treats it as a failure worth a state transition.)
**Recommendation**
When `connectionLevelFailure` is true, reply with
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
Instance Actor can reflect partial success. Add a test asserting the response is not
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
**Resolution**
_Unresolved._
### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
**Description**
`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
status does not abort the batch), but it first calls `EnsureConnected()`
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
caller gets no result map — losing the per-tag outcomes for the tags that already
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
Severity is Low because device writes are real-time control operations with no
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
maintainability and correctness wart.
**Recommendation**
Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
and continues, while still propagating `OperationCanceledException` to abort a
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
complete, consistent result map.
**Resolution**
_Unresolved._