docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring

17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
This commit is contained in:
Joseph Doherty
2026-05-17 00:45:10 -04:00
parent e49846603e
commit 89636e2bbf
6 changed files with 895 additions and 64 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ExternalSystemGateway` |
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -30,20 +30,41 @@ Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGa
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
requirements (timeout, retry settings) that are declared but not implemented.
#### Re-review 2026-05-17 (commit `39d737e`)
All fourteen prior findings remain `Resolved`; the resolutions for findings 001014
were spot-checked against the current source and hold. The re-review walked the full
10-category checklist again and surfaced **three new findings**. The most serious
(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
per-system/per-connection `MaxRetries` of `0` through verbatim, but
`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
option) onto every `HttpClient` in the host, including the Notification Service's
OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
both substantive findings are second-order defects in earlier fixes — the earlier
resolutions did not verify the downstream contract of the S&F engine they integrate
with.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
## Findings
@@ -760,3 +781,144 @@ headers and body so URL/auth/body construction is now verified, not just status
These are new-coverage tests against already-correct behaviour, so they pass on the
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
against regression.
### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
**Description**
The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
`MaxRetries` of `0` means **"never retry"**.
That is the opposite of what the Store-and-Forward engine actually does with the
value. `EnqueueAsync` stores the passed `maxRetries` directly into
`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
every sweep, so the message is **never parked and is retried forever**.
Consequences for a `CachedCall`/`CachedWrite` against a system or connection
configured with `MaxRetries = 0`:
1. A transiently-failing message that the operator intended never to retry is instead
retried on every sweep indefinitely, accumulating in the buffer and repeatedly
re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
the idempotency note in the design doc warns about.
2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
describe that as "honoured" — they verify the persisted column, never the resulting
sweep behaviour, so they lock in the broken semantics.
3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
cached write at every site currently buffers as retry-forever. Before the ESG-004
fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
common path.
**Recommendation**
Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
that "never retry" then needs a different representation such as a nullable field or a
`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
explicit no-retry path — e.g. do not enqueue at all on transient failure when
`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
outcome (parked / not retried), not just the stored column value.
**Resolution**
_Unresolved._
### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
**Description**
The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
The inline comment claims this "applies to the dynamically-named clients created by
`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
anywhere in the host, regardless of name.
The Host registers the External System Gateway alongside other components that also
use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
has its **primary handler replaced** by a `SocketsHttpHandler` whose
`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
1. Silently caps unrelated clients at a value owned by, and named for, a different
component — an operator tuning the ESG option unknowingly throttles Microsoft 365
OAuth2 token acquisition.
2. Overrides/discards any primary-handler configuration those other components add for
their own clients (e.g. a custom handler, proxy, or certificate settings).
This is a leaky, surprising side effect for what the option claims to be a per-ESG
setting; `ConfigureHttpClientDefaults` should not be used to express a single
component's policy.
**Recommendation**
Scope the handler configuration to the gateway's own clients only. The ESG already
creates per-system clients with a deterministic name pattern
(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
on that registration instead of on the global defaults. If a name-pattern registration
is impractical, document the global effect explicitly and rename the option, but the
preferred fix is to stop using `ConfigureHttpClientDefaults`.
**Resolution**
_Unresolved._
### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
**Description**
In `BuildUrl`, the GET/DELETE query-string branch is entered when
`parameters != null && parameters.Count > 0`, but the projection then filters out
null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
method is invoked with a non-empty parameter dictionary whose values are *all* null,
`queryString` is the empty string and the code still executes `url += "?" + queryString`,
producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
on some endpoints, and makes captured request URLs harder to read in logs.
**Recommendation**
Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
string first and check it), so a method whose effective parameter set is empty
produces a clean URL identical to the no-parameters case.
**Resolution**
_Unresolved._