docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring
17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.DataConnectionLayer` |
|
||||
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
|
||||
heuristic. Test coverage is adequate for the happy paths and failover but absent for
|
||||
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
|
||||
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
|
||||
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
|
||||
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
|
||||
and found **4 new findings**: one **High** — the DCL-012 security warning is never
|
||||
seen in production because `RealOpcUaClientFactory.Create()` constructs
|
||||
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
|
||||
**Medium** — initial-connect failures in the `Connecting` state never count toward
|
||||
failover, so a connection whose primary endpoint is unreachable at startup retries the
|
||||
primary forever and never tries the configured backup; one **Medium** —
|
||||
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
|
||||
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
|
||||
telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
|
||||
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
|
||||
so a mid-batch disconnect aborts the whole write batch (the same class of defect
|
||||
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
|
||||
`DataConnectionLayer-014`.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
|
||||
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
|
||||
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
|
||||
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
|
||||
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
|
||||
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
|
||||
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
|
||||
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
|
||||
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
|
||||
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
|
||||
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
|
||||
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
|
||||
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
|
||||
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
|
||||
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
|
||||
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
|
||||
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
|
||||
| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
|
||||
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
|
||||
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
|
||||
non-atomic check-then-set it double-fires (verified by temporarily reverting the
|
||||
guard), and it passes against the atomic fix.
|
||||
|
||||
### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
|
||||
|
||||
**Description**
|
||||
|
||||
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
|
||||
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
|
||||
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
|
||||
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
|
||||
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
|
||||
|
||||
However, the only production code path that constructs a `RealOpcUaClient` is
|
||||
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
|
||||
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
|
||||
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
|
||||
As a result the `_logger` field is always `NullLogger` for every real OPC UA
|
||||
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
|
||||
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
|
||||
enabled — accepting any server certificate on an industrial control link — gets no
|
||||
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
|
||||
therefore not actually effective in production; only the unit test
|
||||
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
|
||||
checks the default value) passes.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
|
||||
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
|
||||
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
|
||||
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
|
||||
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
|
||||
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
|
||||
the default is `false`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
|
||||
|
||||
**Description**
|
||||
|
||||
Failover between the primary and backup endpoints is implemented in two places, both
|
||||
reachable only after a connection has already been `Connected` at least once:
|
||||
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
|
||||
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
|
||||
|
||||
`HandleConnectResult` — the handler for the *initial* connection attempt in the
|
||||
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
|
||||
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
|
||||
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
|
||||
|
||||
Consequence: if the primary endpoint is unreachable when the connection actor first
|
||||
starts — which is the common case after a fresh artifact deployment, a site restart,
|
||||
or a primary that is simply down at that moment — the actor retries the *primary*
|
||||
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
|
||||
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
|
||||
active endpoint becomes unreachable") is silently not honoured for the
|
||||
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
|
||||
despite a healthy backup being configured.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `HandleConnectResult` participate in the failover counter the same way
|
||||
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
|
||||
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
|
||||
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
|
||||
log the failover event) before re-arming the timer. Alternatively, fold the initial
|
||||
connect into the same reconnect path so there is a single failover decision point. Add
|
||||
a regression test for "primary down at startup, backup configured → fails over to
|
||||
backup".
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
|
||||
so the `Connected`-state handler can drive the actor into `Reconnecting`
|
||||
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
|
||||
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
|
||||
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
|
||||
level.
|
||||
|
||||
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
|
||||
told the subscribe **succeeded**, while the connection actor simultaneously transitions
|
||||
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
|
||||
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
|
||||
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
|
||||
healthy" from "accepted, but the connection is currently down" — a misleading
|
||||
success signal on a request that did not do what the response claims.
|
||||
|
||||
(Genuine tag-resolution failures are arguably also reported as overall `true`, but
|
||||
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
|
||||
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
|
||||
already pushed. The connection-level case is the clear defect because the actor itself
|
||||
treats it as a failure worth a state transition.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
When `connectionLevelFailure` is true, reply with
|
||||
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
|
||||
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
|
||||
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
|
||||
Instance Actor can reflect partial success. Add a test asserting the response is not
|
||||
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
|
||||
|
||||
**Description**
|
||||
|
||||
`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
|
||||
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
|
||||
status does not abort the batch), but it first calls `EnsureConnected()`
|
||||
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
|
||||
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
|
||||
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
|
||||
caller gets no result map — losing the per-tag outcomes for the tags that already
|
||||
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
|
||||
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
|
||||
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
|
||||
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
|
||||
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
|
||||
|
||||
Severity is Low because device writes are real-time control operations with no
|
||||
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
|
||||
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
|
||||
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
|
||||
maintainability and correctness wart.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
|
||||
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
|
||||
and continues, while still propagating `OperationCanceledException` to abort a
|
||||
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
|
||||
complete, consistent result map.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user