954 lines
54 KiB
Markdown
954 lines
54 KiB
Markdown
# Code Review — ExternalSystemGateway
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-17 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `39d737e` |
|
||
| Open findings | 0 |
|
||
|
||
## Summary
|
||
|
||
The External System Gateway is a small module (five source files plus options) that
|
||
implements the HTTP/REST client (`ExternalSystemClient`), the database access surface
|
||
(`DatabaseGateway`), and error classification (`ErrorClassifier`). The structure is
|
||
clean and the dual call-mode semantics broadly match the design doc. However, the
|
||
review surfaced several substantive problems that prevent the module from behaving as
|
||
designed. The most serious is that **no store-and-forward delivery handler is ever
|
||
registered** for the `ExternalSystem` or `CachedDbWrite` categories, so cached calls
|
||
and cached writes are buffered but can never actually be delivered on retry — a silent
|
||
data-loss path. Two further high-impact issues are that the **per-system call timeout
|
||
is never applied** to the HTTP client (the design's central error-handling guarantee
|
||
is absent), and that **`CachedCall` double-dispatches the HTTP request** because
|
||
`StoreAndForwardService.EnqueueAsync` itself re-attempts immediate delivery, breaking
|
||
the idempotency expectations. A cluster of medium issues concern resource leaks,
|
||
classification gaps (cancellation conflation), and the dropped `StoreAndForwardResult`.
|
||
Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGateway`
|
||
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
|
||
requirements (timeout, retry settings) that are declared but not implemented.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
All fourteen prior findings remain `Resolved`; the resolutions for findings 001–014
|
||
were spot-checked against the current source and hold. The re-review walked the full
|
||
10-category checklist again and surfaced **three new findings**. The most serious
|
||
(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
|
||
`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
|
||
per-system/per-connection `MaxRetries` of `0` through verbatim, but
|
||
`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
|
||
**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
|
||
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
|
||
assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
|
||
`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
|
||
**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
|
||
option) onto every `HttpClient` in the host, including the Notification Service's
|
||
OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
|
||
(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
|
||
both substantive findings are second-order defects in earlier fixes — the earlier
|
||
resolutions did not verify the downstream contract of the S&F engine they integrate
|
||
with.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
|
||
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
|
||
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
|
||
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
|
||
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
|
||
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
|
||
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
|
||
| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
|
||
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
|
||
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
|
||
|
||
## Findings
|
||
|
||
### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Critical |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` and `CachedWriteAsync` enqueue messages under
|
||
`StoreAndForwardCategory.ExternalSystem` and `StoreAndForwardCategory.CachedDbWrite`.
|
||
`StoreAndForwardService.RegisterDeliveryHandler` is the only mechanism that lets the
|
||
S&F engine actually deliver a buffered message, and a repository-wide search shows it
|
||
is **never called for either category** anywhere in the codebase. Consequences:
|
||
|
||
1. On a transient failure, `EnqueueAsync` falls through to the "No handler registered
|
||
— buffer for later" branch (`StoreAndForwardService.cs:163`) and the message is
|
||
persisted.
|
||
2. During the retry sweep, `AttemptDeliveryAsync` (`StoreAndForwardService.cs:201`)
|
||
logs `"No delivery handler for category {Category}"` and returns without ever
|
||
removing or delivering the message.
|
||
|
||
The result is that every cached external call and cached DB write is silently
|
||
buffered forever and never delivered — a data-loss path for the exact "deferred
|
||
delivery is acceptable" use cases the design doc calls out (posting production data,
|
||
quality reports). The script also receives `WasBuffered: true` / a successful
|
||
`CachedWriteAsync` completion, so the failure is completely invisible.
|
||
|
||
**Recommendation**
|
||
|
||
Register delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
|
||
`StoreAndForwardCategory.CachedDbWrite` during host/site startup. The `ExternalSystem`
|
||
handler should deserialize the payload, re-resolve the system/method, and re-invoke
|
||
`InvokeHttpAsync`, returning `true`/`false`/throwing per the transient-vs-permanent
|
||
contract `EnqueueAsync` expects. The `CachedDbWrite` handler should execute the SQL
|
||
against the named connection. Add an integration test that buffers a message and
|
||
verifies it is delivered by a retry sweep.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. Delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
|
||
`CachedDbWrite` are now registered at site startup in `AkkaHostedService`, after
|
||
`StoreAndForwardService.StartAsync()`. Each handler resolves its consumer in a fresh DI
|
||
scope and calls a new `DeliverBufferedAsync`: `ExternalSystemClient.DeliverBufferedAsync`
|
||
re-resolves the system/method and re-invokes `InvokeHttpAsync`, and
|
||
`DatabaseGateway.DeliverBufferedAsync` executes the buffered SQL — each returning `true`
|
||
on success, `false` (park) when the target no longer exists or fails permanently, and
|
||
throwing on transient failure so the engine retries. `EnqueueAsync` gained an
|
||
`attemptImmediateDelivery` parameter; `CachedCallAsync` passes `false` so registering the
|
||
handler does not dispatch the request twice (the double-dispatch noted in
|
||
`ExternalSystemGateway-003`). Regression tests cover the success, target-removed and
|
||
transient-retry paths. Fixed by the commit whose message references
|
||
`ExternalSystemGateway-001`.
|
||
|
||
### ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
|
||
|
||
**Description**
|
||
|
||
The design doc states each external system definition specifies a timeout that
|
||
"applies to all method calls on that system" and "applies to the HTTP request
|
||
round-trip", and `ExternalSystemGatewayOptions.DefaultHttpTimeout` exists as a
|
||
fallback. In practice no timeout is ever configured. `ServiceCollectionExtensions`
|
||
calls `services.AddHttpClient()` with no per-named-client configuration, and
|
||
`InvokeHttpAsync` calls `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")`
|
||
without setting `client.Timeout` or passing a `CancellationToken` derived from a
|
||
timeout. `SendAsync` is therefore subject only to `HttpClient`'s default 100-second
|
||
timeout, regardless of the system definition or the configured `DefaultHttpTimeout`.
|
||
A slow or hung external system will block the calling Script Execution Actor far
|
||
longer than the operator configured, and the design's core error-handling guarantee
|
||
(timeout → transient classification) does not hold within the intended window.
|
||
|
||
There is also no `Timeout` field on `ExternalSystemDefinition` at all, so even a
|
||
correct implementation has nowhere to read the per-system value from — the entity is
|
||
missing the field the design requires.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `Timeout` (TimeSpan) field to `ExternalSystemDefinition` and have
|
||
`InvokeHttpAsync` enforce it — either by setting `client.Timeout` via a typed/named
|
||
`HttpClient` registration, or by linking a `CancellationTokenSource` with the
|
||
per-system (or `DefaultHttpTimeout`) timeout to the supplied `cancellationToken`
|
||
before `SendAsync`. Ensure the resulting `TaskCanceledException`/`TimeoutException`
|
||
is classified as transient.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). `InvokeHttpAsync` now enforces a call
|
||
timeout: `ExternalSystemClient` takes an `IOptions<ExternalSystemGatewayOptions>` and
|
||
links a `CancellationTokenSource(DefaultHttpTimeout)` with the caller's token before
|
||
`SendAsync` and the response-body read, so the design's "timeout applies to the HTTP
|
||
request round-trip" guarantee now holds within the configured window (default 30s)
|
||
instead of `HttpClient`'s default 100s. A timeout is reclassified as a
|
||
`TransientExternalSystemException`; a caller-initiated cancellation is distinguished
|
||
from a timeout and propagated as `OperationCanceledException` rather than being
|
||
swallowed as transient. Regression tests:
|
||
`Call_SlowSystem_TimesOutAsTransientErrorWithinConfiguredWindow` and
|
||
`Call_CallerCancellation_IsNotMisreportedAsTimeout`.
|
||
|
||
Note (partial scope): the per-*system* `Timeout` field on `ExternalSystemDefinition`
|
||
remains unimplemented — adding it requires a change to `ScadaLink.Commons`, which is
|
||
outside this module's edit scope. Until that entity field exists, the configured
|
||
`DefaultHttpTimeout` is the effective per-call limit for every system. A follow-up
|
||
against the Commons module should add the `Timeout` field and have `InvokeHttpAsync`
|
||
prefer it over the default. This is a tracked follow-up, not a regression.
|
||
|
||
### ExternalSystemGateway-003 — `CachedCall` double-dispatches the HTTP request
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` first calls `InvokeHttpAsync` directly (line 86). On a
|
||
`TransientExternalSystemException` it then calls `_storeAndForward.EnqueueAsync(...)`
|
||
(line 109). `StoreAndForwardService.EnqueueAsync` is **not** a pure enqueue — it
|
||
"Attempts immediate delivery" by invoking the registered delivery handler
|
||
(`StoreAndForwardService.cs:128-159`). If a delivery handler for the `ExternalSystem`
|
||
category is registered (as finding 001 recommends), the HTTP request will be executed
|
||
a **second time** synchronously inside `EnqueueAsync`, immediately after the first
|
||
attempt failed. For a transient failure that is actually a slow/overloaded system,
|
||
this doubles the load and — critically — if the original request did reach the
|
||
external system, the immediate retry produces a duplicate delivery before the script
|
||
even returns, worsening the idempotency hazard the design doc explicitly warns about.
|
||
|
||
**Recommendation**
|
||
|
||
Decide on one dispatch path. Either (a) have `CachedCall` not pre-invoke
|
||
`InvokeHttpAsync` and instead let `EnqueueAsync`'s immediate-delivery attempt be the
|
||
single first attempt (requires the handler to exist and to surface permanent vs
|
||
transient correctly); or (b) add an enqueue-only entry point to
|
||
`StoreAndForwardService` that skips the immediate-delivery attempt, and have
|
||
`CachedCall` use it after its own first attempt. Approach (a) is cleaner and removes
|
||
the duplicated logic.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Re-triage: this finding was already fixed in
|
||
the codebase as a side effect of the `ExternalSystemGateway-001` fix and is no longer
|
||
reproducible against the current source. `StoreAndForwardService.EnqueueAsync` gained an
|
||
`attemptImmediateDelivery` parameter (recommendation approach (b)), and
|
||
`CachedCallAsync` passes `attemptImmediateDelivery: false` after its own first HTTP
|
||
attempt — so `EnqueueAsync` buffers the message for the background retry sweep without
|
||
re-invoking the registered delivery handler, eliminating the duplicate dispatch. A
|
||
dedicated regression test, `CachedCall_TransientFailure_DoesNotImmediatelyRedispatchViaRegisteredHandler`,
|
||
was added in this module's test suite: it registers a counting delivery handler, drives
|
||
a `CachedCall` whose HTTP attempt fails transiently, and asserts the handler is invoked
|
||
zero times during enqueue. The test was verified to fail if `attemptImmediateDelivery`
|
||
is flipped back to `true`.
|
||
|
||
### ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` and `CachedWriteAsync` pass the definition's `MaxRetries` /
|
||
`RetryDelay` to `EnqueueAsync` only when they are non-default
|
||
(`MaxRetries > 0 ? ... : null`, `RetryDelay > TimeSpan.Zero ? ... : null`), otherwise
|
||
falling back to the S&F defaults. The site-side repository that supplies these
|
||
definitions, `SiteExternalSystemRepository.MapExternalSystem`
|
||
(`src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194`), never
|
||
reads `MaxRetries`/`RetryDelay` from SQLite at all — the constructed entities always
|
||
have `MaxRetries == 0` and `RetryDelay == TimeSpan.Zero`. As a result, at sites the
|
||
per-system retry settings the design doc requires are *always* discarded and the
|
||
global S&F defaults are silently used instead. The `> 0` guard in the ESG also makes
|
||
a legitimately-configured `MaxRetries` of 0 ("never retry") indistinguishable from
|
||
"unset", so an operator cannot express "do not retry".
|
||
|
||
**Recommendation**
|
||
|
||
Within this module, drop the `> 0` / `> Zero` guards and pass the definition values
|
||
through directly (or use nullable fields on the entity to distinguish "unset"). The
|
||
companion fix in `SiteExternalSystemRepository` to actually map the retry columns
|
||
should be tracked against the SiteRuntime module.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `CachedCallAsync` and `CachedWriteAsync` now pass
|
||
the definition's `MaxRetries` to `EnqueueAsync` verbatim — the `> 0` guard is dropped, so
|
||
a legitimately-configured `MaxRetries` of 0 ("never retry") is honoured instead of being
|
||
collapsed to the S&F default. The `RetryDelay > TimeSpan.Zero` guard is deliberately
|
||
**kept**: `TimeSpan.Zero` is the entity default for an unconfigured field and a literal
|
||
zero-delay retry loop is not a valid configuration, so falling back to the S&F default
|
||
interval for an unset delay is correct (only `MaxRetries == 0` is a meaningful operator
|
||
choice). Regression test `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`
|
||
buffers a transient failure and asserts the buffered message carries `MaxRetries == 0`
|
||
rather than the S&F default; `CachedCall_TransientFailure_BuffersWithSystemRetrySettings`
|
||
additionally covers a non-default settings pass-through. The companion fix in
|
||
`SiteExternalSystemRepository.MapExternalSystem` to actually read the `MaxRetries` /
|
||
`RetryDelay` columns from SQLite remains a tracked follow-up against the SiteRuntime
|
||
module (outside this module's edit scope) — until then, sites still supply
|
||
`MaxRetries == 0`, which this fix now correctly honours as "never retry".
|
||
|
||
### ExternalSystemGateway-005 — `HttpRequestMessage` and `HttpResponseMessage` are not disposed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167` |
|
||
|
||
**Description**
|
||
|
||
`InvokeHttpAsync` creates an `HttpRequestMessage` (line 133) and receives an
|
||
`HttpResponseMessage` from `SendAsync` (line 155); neither is wrapped in a `using` nor
|
||
explicitly disposed. Both are `IDisposable` and own resources (the request's
|
||
`StringContent`, the response's content stream). Under the per-invocation call volume
|
||
of a busy site this produces avoidable pressure on the finalizer queue and can hold
|
||
socket/stream resources longer than necessary. The success path reads the content but
|
||
never disposes the response; the error path likewise reads `errorBody` and then throws
|
||
without disposing.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the request in `using var request = ...` and the response in
|
||
`using var response = ...` (or call `Dispose()` in a `finally`). Ensure disposal still
|
||
occurs on the exception paths.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now declares the request as
|
||
`using var request` and wraps all response handling in a `using (response)` block, so
|
||
both `IDisposable` instances (and the request's `StringContent` / the response content
|
||
stream) are released on the success path **and** on the permanent/transient
|
||
exception paths. Regression tests `Call_SuccessfulHttp_DisposesRequestAndResponse` and
|
||
`Call_PermanentFailure_StillDisposesRequestAndResponse` use a disposal-tracking
|
||
`HttpMessageHandler`/`HttpContent` and assert both the request and the response content
|
||
are disposed; both were verified to fail before the `using` wrappers were added.
|
||
|
||
### ExternalSystemGateway-006 — `BuildUrl` ignores path templates and appends a trailing slash for empty paths
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — partially re-triaged: trailing-slash bug fixed; path-templating sub-issue is a design decision (see Resolution) |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196` |
|
||
|
||
**Description**
|
||
|
||
`BuildUrl` does `baseUrl.TrimEnd('/') + "/" + path.TrimStart('/')`. When `method.Path`
|
||
is empty (a method that targets the base URL itself), this still appends a `/`,
|
||
producing `https://host/api/` which some servers treat as a different resource than
|
||
`https://host/api`. More importantly, the design doc shows method paths as templates
|
||
like `/recipes/{id}`, but `BuildUrl` performs no placeholder substitution — a `{id}`
|
||
token is sent literally in the URL and the corresponding parameter is instead appended
|
||
as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either
|
||
the design's path-template feature is unimplemented, or the doc is stale; in the
|
||
current code a method defined as `/recipes/{id}` will never produce a correct URL.
|
||
|
||
**Recommendation**
|
||
|
||
Decide whether path templating is in scope. If yes, implement `{name}` substitution
|
||
from `parameters` in `BuildUrl` and exclude substituted parameters from the query
|
||
string/body. If no, update the component design doc to remove the `/recipes/{id}`
|
||
example and state that paths are literal. Also avoid appending a trailing `/` when
|
||
`path` is empty.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). The **trailing-slash bug** is fixed: `BuildUrl`
|
||
now appends a `/`-joined path segment only when the method's path is non-empty after
|
||
trimming, so a method targeting the base URL itself produces `https://host/api` rather
|
||
than `https://host/api/`. Regression tests `Call_MethodWithEmptyPath_DoesNotAppendTrailingSlash`
|
||
and `Call_MethodWithPath_BuildsExpectedUrl` (asserting on the captured request URI)
|
||
cover the empty-path and normal-path cases; the empty-path test was verified to fail
|
||
before the fix.
|
||
|
||
Re-triage of the **path-templating sub-issue** (`{id}` placeholder substitution): this
|
||
is a genuine design decision, not a code bug, and it requires editing the component
|
||
design doc — both outside this module's edit scope (`src/`, `tests/`, this file only).
|
||
The current code treats method paths as literal strings and routes parameters to the
|
||
query string (GET/DELETE) or JSON body (POST/PUT); a method authored as `/recipes/{id}`
|
||
sends the `{id}` token verbatim. **Tracked follow-up / surfaced design question:** the
|
||
design owner must decide whether path templating is in scope — if yes, implement
|
||
`{name}` substitution in `BuildUrl` and exclude substituted params from the
|
||
query/body; if no, the `Component-ExternalSystemGateway.md` `/recipes/{id}` example must
|
||
be changed to a literal path. The trailing-slash defect (the concrete correctness bug
|
||
in this finding) is fully resolved.
|
||
|
||
### ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177` |
|
||
|
||
**Description**
|
||
|
||
On a non-success HTTP response, the full response body is read into `errorBody` and
|
||
embedded verbatim into the exception message (`$"HTTP {code} from {name}: {errorBody}"`),
|
||
which then flows into `ExternalCallResult.ErrorMessage` and back to the calling script,
|
||
and into Site Event Logging. An external system error page can be arbitrarily large
|
||
(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There
|
||
is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry
|
||
and error string returned to scripts. There is also no content-type check before
|
||
treating the body as text.
|
||
|
||
**Recommendation**
|
||
|
||
Truncate `errorBody` to a bounded length (e.g. 1–2 KB) before embedding it, and
|
||
consider logging the full body separately at debug level rather than returning it to
|
||
the script. Optionally only include the body when the content type is textual.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now truncates the external error
|
||
response body to `MaxErrorBodyChars` (2048) via a `Truncate` helper before embedding it
|
||
into the transient/permanent exception message — so a misbehaving or hostile endpoint
|
||
can no longer inflate every script-visible `ErrorMessage` and Site Event Logging entry
|
||
with a multi-megabyte body. When truncation occurs the message is suffixed with
|
||
`… [truncated, N chars total]` so the original size is still visible. Regression test
|
||
`Call_PermanentFailureWithHugeErrorBody_TruncatesErrorMessage` drives a 400 with a
|
||
500 KB body and asserts the resulting `ErrorMessage` is bounded (< 4096 chars); it was
|
||
verified to fail (500 040-char message) before the cap was added. Content-type
|
||
filtering was considered optional in the recommendation and was not implemented — the
|
||
size cap alone closes the inflation/disclosure vector.
|
||
|
||
### ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: root cause already fixed in current source (see Resolution) |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30`, `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159` |
|
||
|
||
**Description**
|
||
|
||
`ErrorClassifier.IsTransient(Exception)` returns `true` for `TaskCanceledException`
|
||
and `OperationCanceledException`. `HttpClient.SendAsync` throws `TaskCanceledException`
|
||
both when its internal timeout elapses *and* when the supplied `CancellationToken` is
|
||
cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting
|
||
down). Because `InvokeHttpAsync`'s `catch` filter treats all of these as transient, a
|
||
caller-initiated cancellation during a `CachedCall` will be misclassified as a
|
||
transient failure and the message will be buffered for retry — work the caller
|
||
explicitly asked to abandon. For a `Call`, a shutdown-time cancellation is reported to
|
||
the script as a "Transient error" rather than an `OperationCanceledException`.
|
||
|
||
**Recommendation**
|
||
|
||
In `InvokeHttpAsync`, check `cancellationToken.IsCancellationRequested` first and
|
||
rethrow `OperationCanceledException` (or let it propagate) before applying transient
|
||
classification. Only treat a cancellation as a timeout when the supplied token is
|
||
*not* the one that was cancelled.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). **Re-triage:** the root cause described — a
|
||
caller-initiated cancellation being misclassified as a transient failure — is **no
|
||
longer present in the current source** and is not reproducible. `InvokeHttpAsync`
|
||
already wraps both `SendAsync` and the response-body `ReadAsStringAsync` in ordered
|
||
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
|
||
filters that rethrow the caller's cancellation *before* the
|
||
`catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch is ever reached
|
||
(this was added alongside the `ExternalSystemGateway-002` timeout fix). A caller-cancel
|
||
therefore propagates as `OperationCanceledException` and is never buffered; only the
|
||
gateway's own timeout token reclassifies as transient.
|
||
|
||
`ErrorClassifier.IsTransient(Exception)` does still return `true` for
|
||
`TaskCanceledException`/`OperationCanceledException`, but that is **correct and
|
||
intentional**: a `TaskCanceledException` raised by an HTTP timeout *is* a genuine
|
||
transient failure, and the only caller (`InvokeHttpAsync:238`) is unreachable for a
|
||
caller-cancellation because the two preceding `when`-filtered catches intercept it
|
||
first. The transient-vs-cancel decision is contextual (which token fired) and cannot
|
||
be made from the exception type alone, which is exactly why the call site does it.
|
||
No source change was required. A regression guard,
|
||
`CachedCall_CallerCancellation_IsNotBufferedAsTransient`, was added: it cancels the
|
||
caller token mid-`CachedCall` and asserts an `OperationCanceledException` is thrown and
|
||
the S&F buffer remains empty (the cancelled work is not retried). The existing
|
||
`Call_CallerCancellation_IsNotMisreportedAsTimeout` covers the synchronous `Call` path.
|
||
|
||
### ExternalSystemGateway-009 — `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: root cause subsumed by the ExternalSystemGateway-003 dispatch redesign (see Resolution) |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` assigns the result of `_storeAndForward.EnqueueAsync(...)` to
|
||
`sfResult` and then never reads it — it unconditionally returns
|
||
`new ExternalCallResult(true, null, null, WasBuffered: true)`. `EnqueueAsync` can
|
||
return `Success == false` (a permanent failure encountered during its
|
||
immediate-delivery attempt — `StoreAndForwardService.cs:142`) or `Buffered == false`
|
||
(delivered immediately). In both cases the ESG still reports the call as buffered and
|
||
successful to the script. A permanent failure surfaced by the S&F immediate attempt is
|
||
therefore silently lost instead of being returned to the script as the design requires
|
||
("On permanent failure (HTTP 4xx), the error is returned synchronously").
|
||
|
||
**Recommendation**
|
||
|
||
Inspect `sfResult`: if `Success == false` return an error `ExternalCallResult`; set
|
||
`WasBuffered` from `sfResult.Buffered` rather than hard-coding `true`. (This finding is
|
||
partly subsumed by the dispatch redesign in finding 003.)
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). **Re-triage:** the stated root cause — "a
|
||
permanent failure surfaced by `EnqueueAsync`'s immediate-delivery attempt is silently
|
||
lost" — **can no longer occur** in the current source, and the dead `sfResult` variable
|
||
the finding cites has already been removed. The `ExternalSystemGateway-003` fix changed
|
||
`CachedCallAsync` to call `EnqueueAsync` with `attemptImmediateDelivery: false`. With
|
||
that flag, `EnqueueAsync` never invokes the registered delivery handler: it skips the
|
||
immediate-delivery block entirely (so the `StoreAndForwardResult(false, …, …)`
|
||
permanent-failure return at `StoreAndForwardService.cs:147` is unreachable from this
|
||
caller) and unconditionally buffers, returning `Accepted: true, WasBuffered: true`
|
||
(`StoreAndForwardService.cs:180`). The `ExternalCallResult(true, null, null,
|
||
WasBuffered: true)` that `CachedCallAsync` returns is therefore now factually correct
|
||
in every reachable case — the message *is* buffered and there is no swallowed permanent
|
||
failure. Permanent HTTP 4xx failures are still surfaced synchronously, because
|
||
`CachedCallAsync` makes its own first HTTP attempt and catches
|
||
`PermanentExternalSystemException` *before* it ever reaches `EnqueueAsync`. No source
|
||
change was required beyond the `ExternalSystemGateway-003` redesign that already landed.
|
||
Coverage: `CachedCall_TransientFailure_BuffersWithSystemRetrySettings` asserts both
|
||
`result.WasBuffered == true` and that the message is genuinely present in the S&F buffer
|
||
(depth == 1), confirming the `WasBuffered: true` claim is not a lie; the existing
|
||
`CachedCall` permanent-failure path is exercised by `Call_Permanent400_ReturnsPermanentError`
|
||
semantics shared via `InvokeHttpAsync`.
|
||
|
||
### ExternalSystemGateway-010 — `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50` |
|
||
|
||
**Description**
|
||
|
||
`GetConnectionAsync` constructs `new SqlConnection(...)` and calls `await
|
||
connection.OpenAsync(...)`. If `OpenAsync` throws (unreachable server, bad
|
||
credentials, cancellation) the just-created `SqlConnection` instance is never disposed
|
||
— the exception propagates and the local reference is lost. While an unopened
|
||
`SqlConnection` is lightweight, over many failing calls this is an avoidable leak. The
|
||
design doc says `Database.Connection()` failures return an error to the script; the
|
||
current code lets a raw `SqlException` escape, which is acceptable, but the leak is
|
||
not.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the open in a try/catch that disposes the connection before rethrowing:
|
||
`try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `GetConnectionAsync` now wraps `OpenAsync` in a
|
||
`try/catch` that calls `await connection.DisposeAsync()` before rethrowing, so a failed
|
||
open (unreachable server, bad credentials, cancellation) no longer leaks the
|
||
`SqlConnection`. Connection creation was extracted into an `internal virtual
|
||
CreateConnection(string)` factory so the failure path is unit-testable. Regression test
|
||
`GetConnection_OpenFails_DisposesConnectionBeforeRethrowing` substitutes a `DbConnection`
|
||
whose `OpenAsync` always throws and asserts the connection is disposed when the
|
||
exception propagates; it was verified to fail before the `try/catch` was added.
|
||
|
||
### ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:360-374`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:169-176` |
|
||
|
||
**Description**
|
||
|
||
`ResolveSystemAndMethodAsync` calls `GetAllExternalSystemsAsync()` and then
|
||
`GetMethodsByExternalSystemIdAsync()` and filters in memory on every single call;
|
||
`ResolveConnectionAsync` calls `GetAllDatabaseConnectionsAsync()` and filters in memory
|
||
on every cached write / connection request. At sites this hits the SQLite repository,
|
||
and `SiteExternalSystemRepository` re-reads and re-parses the methods JSON each time.
|
||
For a hot script path this is unnecessary repeated I/O and allocation. Definitions only
|
||
change on deployment, so they are eminently cacheable.
|
||
|
||
**Recommendation**
|
||
|
||
Add an in-memory cache of system/method/connection definitions keyed by name,
|
||
invalidated on artifact deployment. Alternatively use a name-keyed repository lookup
|
||
rather than fetch-all-then-filter.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). A cross-module change was explicitly
|
||
authorized, so the **name-keyed repository lookup** recommendation was applied — the
|
||
cleaner of the two options, and one that avoids the staleness hazard a
|
||
deployment-invalidated cache would introduce.
|
||
|
||
Three name-keyed methods were added to `IExternalSystemRepository`
|
||
(`ScadaLink.Commons`): `GetExternalSystemByNameAsync(name)`,
|
||
`GetMethodByNameAsync(externalSystemId, methodName)` and
|
||
`GetDatabaseConnectionByNameAsync(name)`. The connection lookup belongs on the same
|
||
interface because database connection definitions are already part of
|
||
`IExternalSystemRepository` (alongside `GetAllDatabaseConnectionsAsync` /
|
||
`GetDatabaseConnectionByIdAsync`), so the existing repository organization was
|
||
followed rather than introducing a new interface.
|
||
|
||
Both implementers of the interface were updated:
|
||
|
||
- `ScadaLink.ConfigurationDatabase.ExternalSystemRepository` — all three are genuine
|
||
server-side keyed queries (`FirstOrDefaultAsync(x => x.Name == name)`, the
|
||
method lookup additionally scoped by `ExternalSystemDefinitionId`), matching the
|
||
existing `GetMethodByNameAsync` / `GetListByNameAsync` / `GetSharedScriptByNameAsync`
|
||
convention in the other Central repositories.
|
||
- `ScadaLink.SiteRuntime.SiteExternalSystemRepository` — `GetExternalSystemByNameAsync`
|
||
and `GetDatabaseConnectionByNameAsync` are genuine single-row indexed SQLite queries
|
||
(`WHERE name = @name`; both tables have `name` as the PRIMARY KEY).
|
||
`GetMethodByNameAsync` resolves the named method from the parent system's
|
||
`method_definitions` JSON column; this still requires resolving the parent system
|
||
(one id→name scan), but the gateway's new call sequence performs **one** scan instead
|
||
of the previous **two** (`GetAllExternalSystemsAsync` + `GetMethodsByExternalSystemIdAsync`,
|
||
which itself scanned), so the site path is strictly better than before — noted as
|
||
the one place a residual scan remains, bounded by the deployed system count.
|
||
|
||
`ExternalSystemClient.ResolveSystemAndMethodAsync` and
|
||
`DatabaseGateway.ResolveConnectionAsync` were rewritten to call the keyed lookups; the
|
||
`GetAllExternalSystemsAsync` / `GetMethodsByExternalSystemIdAsync` /
|
||
`GetAllDatabaseConnectionsAsync` + in-memory `FirstOrDefault` filtering is gone from
|
||
both hot paths.
|
||
|
||
Regression tests (TDD — written first, verified failing/not-compiling before the
|
||
implementation, then confirmed green; one was additionally verified to fail when the
|
||
keyed query is deliberately broken):
|
||
|
||
- ConfigurationDatabase (`RepositoryCoverageTests`):
|
||
`GetExternalSystemByName_ReturnsMatchingRow`,
|
||
`GetExternalSystemByName_MissingName_ReturnsNull`,
|
||
`GetMethodByName_ReturnsMethodScopedToParentSystem`,
|
||
`GetMethodByName_MissingMethod_ReturnsNull`,
|
||
`GetDatabaseConnectionByName_ReturnsMatchingRow`,
|
||
`GetDatabaseConnectionByName_MissingName_ReturnsNull`.
|
||
- SiteRuntime (`SiteRepositoryTests`):
|
||
`ExternalSystemRepository_GetByName_ReturnsMatchingDefinition`,
|
||
`ExternalSystemRepository_GetByName_MissingName_ReturnsNull`,
|
||
`ExternalSystemRepository_GetMethodByName_ResolvesScopedToSystem`,
|
||
`DatabaseConnectionRepository_GetByName_ReturnsMatchingDefinition`.
|
||
- ExternalSystemGateway: the existing `ExternalSystemClientTests` /
|
||
`DatabaseGatewayTests` resolution stubs were migrated to the keyed methods (via
|
||
`StubResolution` / `StubConnection` helpers), so the full gateway suite now exercises
|
||
and protects the keyed-lookup resolution path.
|
||
|
||
`dotnet build ScadaLink.slnx` is clean; `ScadaLink.ExternalSystemGateway.Tests` (54),
|
||
`ScadaLink.ConfigurationDatabase.Tests` (106), `ScadaLink.SiteRuntime.Tests` (196) and
|
||
`ScadaLink.Commons.Tests` (226) all pass.
|
||
|
||
### ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; `_logger` is injected but unused
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22` |
|
||
|
||
**Description**
|
||
|
||
The design doc states permanent failures are "Logged to Site Event Logging", but
|
||
`InvokeHttpAsync` performs no logging on the permanent-failure path. In fact the
|
||
injected `ILogger<ExternalSystemClient>` and `ILogger<DatabaseGateway>` fields are
|
||
never used at all in either class. Either the logging is expected to happen in the
|
||
caller (Script Execution Actor) — in which case the design doc is imprecise about
|
||
where — or it is missing. Separately, `IsTransient(HttpStatusCode)` treats any
|
||
non-success, non-(5xx/408/429) status as permanent without an explicit comment, which
|
||
is a reasonable default but undocumented.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `_logger.LogWarning` on the permanent-failure path (and a debug log on
|
||
transient), or clarify in the design doc that Site Event Logging capture is the
|
||
caller's responsibility and remove the unused `_logger` fields. Add a comment in
|
||
`ErrorClassifier` documenting the "default to permanent" behaviour.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: the
|
||
"`_logger` injected but unused" claim is **stale** — both loggers are already used (the
|
||
`DeliverBufferedAsync` retry-sweep handlers added by `ExternalSystemGateway-001` log
|
||
park/error events). The genuine remaining gap — `InvokeHttpAsync` performing no logging
|
||
on the HTTP-failure paths — is now fixed: `InvokeHttpAsync` emits a
|
||
`_logger.LogWarning` on the permanent-failure path (status code, system, method,
|
||
truncated error body) so a permanent failure is visible in Site Event Logging as the
|
||
design requires, and a `_logger.LogDebug` on the transient-failure path (transient
|
||
failures are normal retry/S&F operation and must not be noisy at warning level). A
|
||
documentation comment was added to `ErrorClassifier.IsTransient(HttpStatusCode)`
|
||
explaining the "any other non-success status defaults to permanent" behaviour and why
|
||
permanent is the safe default. Regression tests: `Call_PermanentFailure_LogsAWarning`
|
||
(asserts a warning carrying the system name is emitted; verified to fail before the
|
||
`LogWarning` was added) and `Call_TransientFailure_DoesNotLogAtWarningOrAbove` (guards
|
||
against over-logging transient failures).
|
||
|
||
### ExternalSystemGateway-013 — `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
|
||
|
||
**Description**
|
||
|
||
`ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem` (default 10) and
|
||
`DefaultHttpTimeout` (default 30s) are bound from configuration but neither is read
|
||
anywhere. `AddHttpClient()` registers the default factory with no
|
||
`ConfigurePrimaryHttpMessageHandler`/`SocketsHttpHandler` `MaxConnectionsPerServer` and
|
||
no `Timeout`, so both options have no effect. An operator setting these values gets
|
||
them silently ignored — a misleading configuration surface (`DefaultHttpTimeout` is
|
||
also referenced by finding 002).
|
||
|
||
**Recommendation**
|
||
|
||
Either wire the options into a named/typed `HttpClient` registration (set
|
||
`MaxConnectionsPerServer` on the primary handler, set `Timeout`), or remove the unused
|
||
options to avoid implying behaviour that does not exist.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text:
|
||
`DefaultHttpTimeout` is **no longer unused** — it became the effective per-call HTTP
|
||
round-trip limit when `ExternalSystemGateway-002` was fixed (`InvokeHttpAsync` builds a
|
||
linked `CancellationTokenSource(DefaultHttpTimeout)`). The genuinely-unused option,
|
||
`MaxConcurrentConnectionsPerSystem`, is now wired in: `AddExternalSystemGateway` adds a
|
||
`ConfigureHttpClientDefaults` registration that supplies a `SocketsHttpHandler` whose
|
||
`MaxConnectionsPerServer` is bound from the option, so it applies to every per-system
|
||
named client (`ExternalSystem_{name}`) the gateway creates rather than being silently
|
||
ignored. Regression test
|
||
`ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsAppliedToTheNamedHttpClientPrimaryHandler`
|
||
builds the DI container, resolves the named client's handler chain via
|
||
`IHttpMessageHandlerFactory`, walks to the primary handler and asserts
|
||
`SocketsHttpHandler.MaxConnectionsPerServer` equals the configured value; it was verified
|
||
to fail before the wiring was added.
|
||
|
||
### ExternalSystemGateway-014 — Cached-call buffering path and `DatabaseGateway` are untested
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1`, `tests/ScadaLink.ExternalSystemGateway.Tests/DatabaseGatewayTests.cs` |
|
||
|
||
**Description**
|
||
|
||
`ExternalSystemClientTests` covers system/method not-found, success, transient 500 and
|
||
permanent 400 for `CallAsync`, plus `CachedCall` not-found and success. It does **not**
|
||
cover: the `CachedCall` transient-failure → S&F buffering branch (the most
|
||
behaviour-rich path, including the `_storeAndForward == null` fallback and `WasBuffered`
|
||
semantics), the `CachedCall` permanent-failure branch, connection-exception
|
||
classification (`HttpRequestException` thrown by the handler), `BuildUrl` query-string
|
||
construction, and `ApplyAuth` for the apikey/basic variants. There is **no test file
|
||
for `DatabaseGateway`** at all — `GetConnectionAsync` not-found, `CachedWriteAsync`
|
||
not-found, and the `_storeAndForward == null` guard are entirely uncovered. The
|
||
`MockHttpMessageHandler` also does not assert request URL/headers/body, so auth and
|
||
URL construction are unverified.
|
||
|
||
**Recommendation**
|
||
|
||
Add tests for the `CachedCall` transient/buffering paths (with a substituted S&F
|
||
service), `DatabaseGateway` not-found and null-S&F guards, and `BuildUrl`/`ApplyAuth`
|
||
by asserting on the captured `HttpRequestMessage` in the mock handler.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: a number
|
||
of the listed gaps were **already closed** by tests added when findings 001–010 were
|
||
fixed — `DatabaseGatewayTests.cs` now exists (not-found, null-S&F guard, buffered-write
|
||
delivery), and the `CachedCall` transient/buffering and permanent paths are covered by
|
||
the `ExternalSystemGateway-001/003/004/008` regression tests. The remaining genuine
|
||
coverage gaps are now closed with new tests:
|
||
|
||
- `Call_GetWithParameters_AppendsEscapedQueryString` — `BuildUrl` GET query-string
|
||
construction, asserting on the captured request URI and that an `&` inside a value is
|
||
percent-escaped.
|
||
- `Call_PostWithParameters_SendsJsonBody` — POST JSON-body construction.
|
||
- `Call_ApiKeyAuthWithDefaultHeader_SendsXApiKeyHeader`,
|
||
`Call_ApiKeyAuthWithCustomHeader_SendsNamedHeader`,
|
||
`Call_BasicAuth_SendsBase64AuthorizationHeader` — `ApplyAuth` for all three auth
|
||
variants, asserting on the captured request headers.
|
||
- `Call_ConnectionError_IsClassifiedAsTransient` — a connection-level
|
||
`HttpRequestException` is classified transient.
|
||
- `CachedWrite_BuffersTheWriteWithConnectionRetrySettings` and
|
||
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` — the `DatabaseGateway`
|
||
`CachedWrite` happy-path enqueue against a real S&F service.
|
||
|
||
The shared `RequestCapturingHandler` test helper was extended to capture request
|
||
headers and body so URL/auth/body construction is now verified, not just status codes.
|
||
These are new-coverage tests against already-correct behaviour, so they pass on the
|
||
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
|
||
against regression.
|
||
|
||
### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
|
||
|
||
**Description**
|
||
|
||
The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
|
||
that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
|
||
to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
|
||
comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
|
||
tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
|
||
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
|
||
`MaxRetries` of `0` means **"never retry"**.
|
||
|
||
That is the opposite of what the Store-and-Forward engine actually does with the
|
||
value. `EnqueueAsync` stores the passed `maxRetries` directly into
|
||
`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
|
||
doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
|
||
(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
|
||
`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
|
||
(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
|
||
every sweep, so the message is **never parked and is retried forever**.
|
||
|
||
Consequences for a `CachedCall`/`CachedWrite` against a system or connection
|
||
configured with `MaxRetries = 0`:
|
||
|
||
1. A transiently-failing message that the operator intended never to retry is instead
|
||
retried on every sweep indefinitely, accumulating in the buffer and repeatedly
|
||
re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
|
||
the idempotency note in the design doc warns about.
|
||
2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
|
||
describe that as "honoured" — they verify the persisted column, never the resulting
|
||
sweep behaviour, so they lock in the broken semantics.
|
||
3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
|
||
open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
|
||
cached write at every site currently buffers as retry-forever. Before the ESG-004
|
||
fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
|
||
applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
|
||
common path.
|
||
|
||
**Recommendation**
|
||
|
||
Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
|
||
Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
|
||
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
|
||
that "never retry" then needs a different representation such as a nullable field or a
|
||
`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
|
||
explicit no-retry path — e.g. do not enqueue at all on transient failure when
|
||
`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
|
||
`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
|
||
means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
|
||
outcome (parked / not retried), not just the stored column value.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: the S&F engine treats a stored
|
||
`MaxRetries == 0` as "no limit / retry forever" (`StoreAndForwardMessage.MaxRetries`
|
||
doc "0 = no limit"; sweep guard `MaxRetries > 0 && RetryCount >= MaxRetries`), while
|
||
the entity's non-nullable `int MaxRetries` defaults to `0` — so passing it verbatim
|
||
buffered every cached call/write as an unbounded retry loop. Fix (ESG-side only,
|
||
recommendation (a)): `CachedCallAsync` and `CachedWriteAsync` now pass
|
||
`MaxRetries > 0 ? MaxRetries : null`, so an entity `0` is treated as "unset" and the
|
||
bounded S&F `DefaultMaxRetries` applies; the misleading "0 = never retry" inline
|
||
comments were corrected. The two `ZeroMaxRetries...` tests were rewritten to
|
||
`CachedCall_TransientFailure_ZeroMaxRetriesIsTreatedAsUnsetNotRetryForever` /
|
||
`CachedWrite_ZeroMaxRetriesIsTreatedAsUnsetNotRetryForever`, asserting the buffered
|
||
message carries the bounded default (99) and never `0`.
|
||
|
||
### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
|
||
|
||
**Description**
|
||
|
||
The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
|
||
calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
|
||
The inline comment claims this "applies to the dynamically-named clients created by
|
||
`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
|
||
adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
|
||
anywhere in the host, regardless of name.
|
||
|
||
The Host registers the External System Gateway alongside other components that also
|
||
use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
|
||
and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
|
||
present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
|
||
has its **primary handler replaced** by a `SocketsHttpHandler` whose
|
||
`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
|
||
|
||
1. Silently caps unrelated clients at a value owned by, and named for, a different
|
||
component — an operator tuning the ESG option unknowingly throttles Microsoft 365
|
||
OAuth2 token acquisition.
|
||
2. Overrides/discards any primary-handler configuration those other components add for
|
||
their own clients (e.g. a custom handler, proxy, or certificate settings).
|
||
|
||
This is a leaky, surprising side effect for what the option claims to be a per-ESG
|
||
setting; `ConfigureHttpClientDefaults` should not be used to express a single
|
||
component's policy.
|
||
|
||
**Recommendation**
|
||
|
||
Scope the handler configuration to the gateway's own clients only. The ESG already
|
||
creates per-system clients with a deterministic name pattern
|
||
(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
|
||
factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
|
||
on that registration instead of on the global defaults. If a name-pattern registration
|
||
is impractical, document the global effect explicitly and rename the option, but the
|
||
preferred fix is to stop using `ConfigureHttpClientDefaults`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: `ConfigureHttpClientDefaults` is
|
||
process-global and replaced the primary handler of every `IHttpClientFactory` client
|
||
in the host, leaking the ESG connection cap onto unrelated clients. Fix: the global
|
||
`ConfigureHttpClientDefaults` registration was replaced with an
|
||
`IConfigureNamedOptions<HttpClientFactoryOptions>` (`GatewayHttpClientConfigurator`)
|
||
that applies the `SocketsHttpHandler`/`MaxConnectionsPerServer` cap only to clients
|
||
whose name starts with `ExternalSystem_` (the gateway's own per-system clients), so
|
||
clients owned by other components keep their own (or the framework default) primary
|
||
handler. Regression test
|
||
`ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsNotAppliedToNonGatewayHttpClients`
|
||
asserts a non-gateway client does not inherit the cap while the gateway client still
|
||
does; it was verified to fail before the fix.
|
||
|
||
### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
|
||
|
||
**Description**
|
||
|
||
In `BuildUrl`, the GET/DELETE query-string branch is entered when
|
||
`parameters != null && parameters.Count > 0`, but the projection then filters out
|
||
null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
|
||
method is invoked with a non-empty parameter dictionary whose values are *all* null,
|
||
`queryString` is the empty string and the code still executes `url += "?" + queryString`,
|
||
producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
|
||
tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
|
||
on some endpoints, and makes captured request URLs harder to read in logs.
|
||
|
||
**Recommendation**
|
||
|
||
Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
|
||
string first and check it), so a method whose effective parameter set is empty
|
||
produces a clean URL identical to the no-parameters case.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: `BuildUrl` appended `"?" + queryString`
|
||
whenever the GET/DELETE parameter dictionary was non-empty, even when every value
|
||
was null and `queryString` was the empty string, yielding a bare trailing `?`. Fix:
|
||
`BuildUrl` now appends `"?" + queryString` only when `queryString.Length > 0`, so a
|
||
method whose effective parameter set is empty produces a URL identical to the
|
||
no-parameters case. Regression test
|
||
`Call_GetWithAllNullParameters_DoesNotAppendTrailingQuestionMark` asserts the
|
||
captured request URI has no trailing `?`; it was verified to fail before the fix.
|