722 lines
40 KiB
Markdown
722 lines
40 KiB
Markdown
# Code Review — ExternalSystemGateway
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-16 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `9c60592` |
|
||
| Open findings | 1 |
|
||
|
||
## Summary
|
||
|
||
The External System Gateway is a small module (five source files plus options) that
|
||
implements the HTTP/REST client (`ExternalSystemClient`), the database access surface
|
||
(`DatabaseGateway`), and error classification (`ErrorClassifier`). The structure is
|
||
clean and the dual call-mode semantics broadly match the design doc. However, the
|
||
review surfaced several substantive problems that prevent the module from behaving as
|
||
designed. The most serious is that **no store-and-forward delivery handler is ever
|
||
registered** for the `ExternalSystem` or `CachedDbWrite` categories, so cached calls
|
||
and cached writes are buffered but can never actually be delivered on retry — a silent
|
||
data-loss path. Two further high-impact issues are that the **per-system call timeout
|
||
is never applied** to the HTTP client (the design's central error-handling guarantee
|
||
is absent), and that **`CachedCall` double-dispatches the HTTP request** because
|
||
`StoreAndForwardService.EnqueueAsync` itself re-attempts immediate delivery, breaking
|
||
the idempotency expectations. A cluster of medium issues concern resource leaks,
|
||
classification gaps (cancellation conflation), and the dropped `StoreAndForwardResult`.
|
||
Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGateway`
|
||
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
|
||
requirements (timeout, retry settings) that are declared but not implemented.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
|
||
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
|
||
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
|
||
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
|
||
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
|
||
| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
|
||
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
|
||
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
|
||
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
|
||
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
|
||
|
||
## Findings
|
||
|
||
### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Critical |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` and `CachedWriteAsync` enqueue messages under
|
||
`StoreAndForwardCategory.ExternalSystem` and `StoreAndForwardCategory.CachedDbWrite`.
|
||
`StoreAndForwardService.RegisterDeliveryHandler` is the only mechanism that lets the
|
||
S&F engine actually deliver a buffered message, and a repository-wide search shows it
|
||
is **never called for either category** anywhere in the codebase. Consequences:
|
||
|
||
1. On a transient failure, `EnqueueAsync` falls through to the "No handler registered
|
||
— buffer for later" branch (`StoreAndForwardService.cs:163`) and the message is
|
||
persisted.
|
||
2. During the retry sweep, `AttemptDeliveryAsync` (`StoreAndForwardService.cs:201`)
|
||
logs `"No delivery handler for category {Category}"` and returns without ever
|
||
removing or delivering the message.
|
||
|
||
The result is that every cached external call and cached DB write is silently
|
||
buffered forever and never delivered — a data-loss path for the exact "deferred
|
||
delivery is acceptable" use cases the design doc calls out (posting production data,
|
||
quality reports). The script also receives `WasBuffered: true` / a successful
|
||
`CachedWriteAsync` completion, so the failure is completely invisible.
|
||
|
||
**Recommendation**
|
||
|
||
Register delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
|
||
`StoreAndForwardCategory.CachedDbWrite` during host/site startup. The `ExternalSystem`
|
||
handler should deserialize the payload, re-resolve the system/method, and re-invoke
|
||
`InvokeHttpAsync`, returning `true`/`false`/throwing per the transient-vs-permanent
|
||
contract `EnqueueAsync` expects. The `CachedDbWrite` handler should execute the SQL
|
||
against the named connection. Add an integration test that buffers a message and
|
||
verifies it is delivered by a retry sweep.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. Delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
|
||
`CachedDbWrite` are now registered at site startup in `AkkaHostedService`, after
|
||
`StoreAndForwardService.StartAsync()`. Each handler resolves its consumer in a fresh DI
|
||
scope and calls a new `DeliverBufferedAsync`: `ExternalSystemClient.DeliverBufferedAsync`
|
||
re-resolves the system/method and re-invokes `InvokeHttpAsync`, and
|
||
`DatabaseGateway.DeliverBufferedAsync` executes the buffered SQL — each returning `true`
|
||
on success, `false` (park) when the target no longer exists or fails permanently, and
|
||
throwing on transient failure so the engine retries. `EnqueueAsync` gained an
|
||
`attemptImmediateDelivery` parameter; `CachedCallAsync` passes `false` so registering the
|
||
handler does not dispatch the request twice (the double-dispatch noted in
|
||
`ExternalSystemGateway-003`). Regression tests cover the success, target-removed and
|
||
transient-retry paths. Fixed by the commit whose message references
|
||
`ExternalSystemGateway-001`.
|
||
|
||
### ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
|
||
|
||
**Description**
|
||
|
||
The design doc states each external system definition specifies a timeout that
|
||
"applies to all method calls on that system" and "applies to the HTTP request
|
||
round-trip", and `ExternalSystemGatewayOptions.DefaultHttpTimeout` exists as a
|
||
fallback. In practice no timeout is ever configured. `ServiceCollectionExtensions`
|
||
calls `services.AddHttpClient()` with no per-named-client configuration, and
|
||
`InvokeHttpAsync` calls `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")`
|
||
without setting `client.Timeout` or passing a `CancellationToken` derived from a
|
||
timeout. `SendAsync` is therefore subject only to `HttpClient`'s default 100-second
|
||
timeout, regardless of the system definition or the configured `DefaultHttpTimeout`.
|
||
A slow or hung external system will block the calling Script Execution Actor far
|
||
longer than the operator configured, and the design's core error-handling guarantee
|
||
(timeout → transient classification) does not hold within the intended window.
|
||
|
||
There is also no `Timeout` field on `ExternalSystemDefinition` at all, so even a
|
||
correct implementation has nowhere to read the per-system value from — the entity is
|
||
missing the field the design requires.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `Timeout` (TimeSpan) field to `ExternalSystemDefinition` and have
|
||
`InvokeHttpAsync` enforce it — either by setting `client.Timeout` via a typed/named
|
||
`HttpClient` registration, or by linking a `CancellationTokenSource` with the
|
||
per-system (or `DefaultHttpTimeout`) timeout to the supplied `cancellationToken`
|
||
before `SendAsync`. Ensure the resulting `TaskCanceledException`/`TimeoutException`
|
||
is classified as transient.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). `InvokeHttpAsync` now enforces a call
|
||
timeout: `ExternalSystemClient` takes an `IOptions<ExternalSystemGatewayOptions>` and
|
||
links a `CancellationTokenSource(DefaultHttpTimeout)` with the caller's token before
|
||
`SendAsync` and the response-body read, so the design's "timeout applies to the HTTP
|
||
request round-trip" guarantee now holds within the configured window (default 30s)
|
||
instead of `HttpClient`'s default 100s. A timeout is reclassified as a
|
||
`TransientExternalSystemException`; a caller-initiated cancellation is distinguished
|
||
from a timeout and propagated as `OperationCanceledException` rather than being
|
||
swallowed as transient. Regression tests:
|
||
`Call_SlowSystem_TimesOutAsTransientErrorWithinConfiguredWindow` and
|
||
`Call_CallerCancellation_IsNotMisreportedAsTimeout`.
|
||
|
||
Note (partial scope): the per-*system* `Timeout` field on `ExternalSystemDefinition`
|
||
remains unimplemented — adding it requires a change to `ScadaLink.Commons`, which is
|
||
outside this module's edit scope. Until that entity field exists, the configured
|
||
`DefaultHttpTimeout` is the effective per-call limit for every system. A follow-up
|
||
against the Commons module should add the `Timeout` field and have `InvokeHttpAsync`
|
||
prefer it over the default. This is a tracked follow-up, not a regression.
|
||
|
||
### ExternalSystemGateway-003 — `CachedCall` double-dispatches the HTTP request
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` first calls `InvokeHttpAsync` directly (line 86). On a
|
||
`TransientExternalSystemException` it then calls `_storeAndForward.EnqueueAsync(...)`
|
||
(line 109). `StoreAndForwardService.EnqueueAsync` is **not** a pure enqueue — it
|
||
"Attempts immediate delivery" by invoking the registered delivery handler
|
||
(`StoreAndForwardService.cs:128-159`). If a delivery handler for the `ExternalSystem`
|
||
category is registered (as finding 001 recommends), the HTTP request will be executed
|
||
a **second time** synchronously inside `EnqueueAsync`, immediately after the first
|
||
attempt failed. For a transient failure that is actually a slow/overloaded system,
|
||
this doubles the load and — critically — if the original request did reach the
|
||
external system, the immediate retry produces a duplicate delivery before the script
|
||
even returns, worsening the idempotency hazard the design doc explicitly warns about.
|
||
|
||
**Recommendation**
|
||
|
||
Decide on one dispatch path. Either (a) have `CachedCall` not pre-invoke
|
||
`InvokeHttpAsync` and instead let `EnqueueAsync`'s immediate-delivery attempt be the
|
||
single first attempt (requires the handler to exist and to surface permanent vs
|
||
transient correctly); or (b) add an enqueue-only entry point to
|
||
`StoreAndForwardService` that skips the immediate-delivery attempt, and have
|
||
`CachedCall` use it after its own first attempt. Approach (a) is cleaner and removes
|
||
the duplicated logic.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Re-triage: this finding was already fixed in
|
||
the codebase as a side effect of the `ExternalSystemGateway-001` fix and is no longer
|
||
reproducible against the current source. `StoreAndForwardService.EnqueueAsync` gained an
|
||
`attemptImmediateDelivery` parameter (recommendation approach (b)), and
|
||
`CachedCallAsync` passes `attemptImmediateDelivery: false` after its own first HTTP
|
||
attempt — so `EnqueueAsync` buffers the message for the background retry sweep without
|
||
re-invoking the registered delivery handler, eliminating the duplicate dispatch. A
|
||
dedicated regression test, `CachedCall_TransientFailure_DoesNotImmediatelyRedispatchViaRegisteredHandler`,
|
||
was added in this module's test suite: it registers a counting delivery handler, drives
|
||
a `CachedCall` whose HTTP attempt fails transiently, and asserts the handler is invoked
|
||
zero times during enqueue. The test was verified to fail if `attemptImmediateDelivery`
|
||
is flipped back to `true`.
|
||
|
||
### ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` and `CachedWriteAsync` pass the definition's `MaxRetries` /
|
||
`RetryDelay` to `EnqueueAsync` only when they are non-default
|
||
(`MaxRetries > 0 ? ... : null`, `RetryDelay > TimeSpan.Zero ? ... : null`), otherwise
|
||
falling back to the S&F defaults. The site-side repository that supplies these
|
||
definitions, `SiteExternalSystemRepository.MapExternalSystem`
|
||
(`src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194`), never
|
||
reads `MaxRetries`/`RetryDelay` from SQLite at all — the constructed entities always
|
||
have `MaxRetries == 0` and `RetryDelay == TimeSpan.Zero`. As a result, at sites the
|
||
per-system retry settings the design doc requires are *always* discarded and the
|
||
global S&F defaults are silently used instead. The `> 0` guard in the ESG also makes
|
||
a legitimately-configured `MaxRetries` of 0 ("never retry") indistinguishable from
|
||
"unset", so an operator cannot express "do not retry".
|
||
|
||
**Recommendation**
|
||
|
||
Within this module, drop the `> 0` / `> Zero` guards and pass the definition values
|
||
through directly (or use nullable fields on the entity to distinguish "unset"). The
|
||
companion fix in `SiteExternalSystemRepository` to actually map the retry columns
|
||
should be tracked against the SiteRuntime module.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `CachedCallAsync` and `CachedWriteAsync` now pass
|
||
the definition's `MaxRetries` to `EnqueueAsync` verbatim — the `> 0` guard is dropped, so
|
||
a legitimately-configured `MaxRetries` of 0 ("never retry") is honoured instead of being
|
||
collapsed to the S&F default. The `RetryDelay > TimeSpan.Zero` guard is deliberately
|
||
**kept**: `TimeSpan.Zero` is the entity default for an unconfigured field and a literal
|
||
zero-delay retry loop is not a valid configuration, so falling back to the S&F default
|
||
interval for an unset delay is correct (only `MaxRetries == 0` is a meaningful operator
|
||
choice). Regression test `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`
|
||
buffers a transient failure and asserts the buffered message carries `MaxRetries == 0`
|
||
rather than the S&F default; `CachedCall_TransientFailure_BuffersWithSystemRetrySettings`
|
||
additionally covers a non-default settings pass-through. The companion fix in
|
||
`SiteExternalSystemRepository.MapExternalSystem` to actually read the `MaxRetries` /
|
||
`RetryDelay` columns from SQLite remains a tracked follow-up against the SiteRuntime
|
||
module (outside this module's edit scope) — until then, sites still supply
|
||
`MaxRetries == 0`, which this fix now correctly honours as "never retry".
|
||
|
||
### ExternalSystemGateway-005 — `HttpRequestMessage` and `HttpResponseMessage` are not disposed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167` |
|
||
|
||
**Description**
|
||
|
||
`InvokeHttpAsync` creates an `HttpRequestMessage` (line 133) and receives an
|
||
`HttpResponseMessage` from `SendAsync` (line 155); neither is wrapped in a `using` nor
|
||
explicitly disposed. Both are `IDisposable` and own resources (the request's
|
||
`StringContent`, the response's content stream). Under the per-invocation call volume
|
||
of a busy site this produces avoidable pressure on the finalizer queue and can hold
|
||
socket/stream resources longer than necessary. The success path reads the content but
|
||
never disposes the response; the error path likewise reads `errorBody` and then throws
|
||
without disposing.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the request in `using var request = ...` and the response in
|
||
`using var response = ...` (or call `Dispose()` in a `finally`). Ensure disposal still
|
||
occurs on the exception paths.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now declares the request as
|
||
`using var request` and wraps all response handling in a `using (response)` block, so
|
||
both `IDisposable` instances (and the request's `StringContent` / the response content
|
||
stream) are released on the success path **and** on the permanent/transient
|
||
exception paths. Regression tests `Call_SuccessfulHttp_DisposesRequestAndResponse` and
|
||
`Call_PermanentFailure_StillDisposesRequestAndResponse` use a disposal-tracking
|
||
`HttpMessageHandler`/`HttpContent` and assert both the request and the response content
|
||
are disposed; both were verified to fail before the `using` wrappers were added.
|
||
|
||
### ExternalSystemGateway-006 — `BuildUrl` ignores path templates and appends a trailing slash for empty paths
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — partially re-triaged: trailing-slash bug fixed; path-templating sub-issue is a design decision (see Resolution) |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196` |
|
||
|
||
**Description**
|
||
|
||
`BuildUrl` does `baseUrl.TrimEnd('/') + "/" + path.TrimStart('/')`. When `method.Path`
|
||
is empty (a method that targets the base URL itself), this still appends a `/`,
|
||
producing `https://host/api/` which some servers treat as a different resource than
|
||
`https://host/api`. More importantly, the design doc shows method paths as templates
|
||
like `/recipes/{id}`, but `BuildUrl` performs no placeholder substitution — a `{id}`
|
||
token is sent literally in the URL and the corresponding parameter is instead appended
|
||
as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either
|
||
the design's path-template feature is unimplemented, or the doc is stale; in the
|
||
current code a method defined as `/recipes/{id}` will never produce a correct URL.
|
||
|
||
**Recommendation**
|
||
|
||
Decide whether path templating is in scope. If yes, implement `{name}` substitution
|
||
from `parameters` in `BuildUrl` and exclude substituted parameters from the query
|
||
string/body. If no, update the component design doc to remove the `/recipes/{id}`
|
||
example and state that paths are literal. Also avoid appending a trailing `/` when
|
||
`path` is empty.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). The **trailing-slash bug** is fixed: `BuildUrl`
|
||
now appends a `/`-joined path segment only when the method's path is non-empty after
|
||
trimming, so a method targeting the base URL itself produces `https://host/api` rather
|
||
than `https://host/api/`. Regression tests `Call_MethodWithEmptyPath_DoesNotAppendTrailingSlash`
|
||
and `Call_MethodWithPath_BuildsExpectedUrl` (asserting on the captured request URI)
|
||
cover the empty-path and normal-path cases; the empty-path test was verified to fail
|
||
before the fix.
|
||
|
||
Re-triage of the **path-templating sub-issue** (`{id}` placeholder substitution): this
|
||
is a genuine design decision, not a code bug, and it requires editing the component
|
||
design doc — both outside this module's edit scope (`src/`, `tests/`, this file only).
|
||
The current code treats method paths as literal strings and routes parameters to the
|
||
query string (GET/DELETE) or JSON body (POST/PUT); a method authored as `/recipes/{id}`
|
||
sends the `{id}` token verbatim. **Tracked follow-up / surfaced design question:** the
|
||
design owner must decide whether path templating is in scope — if yes, implement
|
||
`{name}` substitution in `BuildUrl` and exclude substituted params from the
|
||
query/body; if no, the `Component-ExternalSystemGateway.md` `/recipes/{id}` example must
|
||
be changed to a literal path. The trailing-slash defect (the concrete correctness bug
|
||
in this finding) is fully resolved.
|
||
|
||
### ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177` |
|
||
|
||
**Description**
|
||
|
||
On a non-success HTTP response, the full response body is read into `errorBody` and
|
||
embedded verbatim into the exception message (`$"HTTP {code} from {name}: {errorBody}"`),
|
||
which then flows into `ExternalCallResult.ErrorMessage` and back to the calling script,
|
||
and into Site Event Logging. An external system error page can be arbitrarily large
|
||
(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There
|
||
is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry
|
||
and error string returned to scripts. There is also no content-type check before
|
||
treating the body as text.
|
||
|
||
**Recommendation**
|
||
|
||
Truncate `errorBody` to a bounded length (e.g. 1–2 KB) before embedding it, and
|
||
consider logging the full body separately at debug level rather than returning it to
|
||
the script. Optionally only include the body when the content type is textual.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now truncates the external error
|
||
response body to `MaxErrorBodyChars` (2048) via a `Truncate` helper before embedding it
|
||
into the transient/permanent exception message — so a misbehaving or hostile endpoint
|
||
can no longer inflate every script-visible `ErrorMessage` and Site Event Logging entry
|
||
with a multi-megabyte body. When truncation occurs the message is suffixed with
|
||
`… [truncated, N chars total]` so the original size is still visible. Regression test
|
||
`Call_PermanentFailureWithHugeErrorBody_TruncatesErrorMessage` drives a 400 with a
|
||
500 KB body and asserts the resulting `ErrorMessage` is bounded (< 4096 chars); it was
|
||
verified to fail (500 040-char message) before the cap was added. Content-type
|
||
filtering was considered optional in the recommendation and was not implemented — the
|
||
size cap alone closes the inflation/disclosure vector.
|
||
|
||
### ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: root cause already fixed in current source (see Resolution) |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30`, `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159` |
|
||
|
||
**Description**
|
||
|
||
`ErrorClassifier.IsTransient(Exception)` returns `true` for `TaskCanceledException`
|
||
and `OperationCanceledException`. `HttpClient.SendAsync` throws `TaskCanceledException`
|
||
both when its internal timeout elapses *and* when the supplied `CancellationToken` is
|
||
cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting
|
||
down). Because `InvokeHttpAsync`'s `catch` filter treats all of these as transient, a
|
||
caller-initiated cancellation during a `CachedCall` will be misclassified as a
|
||
transient failure and the message will be buffered for retry — work the caller
|
||
explicitly asked to abandon. For a `Call`, a shutdown-time cancellation is reported to
|
||
the script as a "Transient error" rather than an `OperationCanceledException`.
|
||
|
||
**Recommendation**
|
||
|
||
In `InvokeHttpAsync`, check `cancellationToken.IsCancellationRequested` first and
|
||
rethrow `OperationCanceledException` (or let it propagate) before applying transient
|
||
classification. Only treat a cancellation as a timeout when the supplied token is
|
||
*not* the one that was cancelled.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). **Re-triage:** the root cause described — a
|
||
caller-initiated cancellation being misclassified as a transient failure — is **no
|
||
longer present in the current source** and is not reproducible. `InvokeHttpAsync`
|
||
already wraps both `SendAsync` and the response-body `ReadAsStringAsync` in ordered
|
||
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
|
||
filters that rethrow the caller's cancellation *before* the
|
||
`catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch is ever reached
|
||
(this was added alongside the `ExternalSystemGateway-002` timeout fix). A caller-cancel
|
||
therefore propagates as `OperationCanceledException` and is never buffered; only the
|
||
gateway's own timeout token reclassifies as transient.
|
||
|
||
`ErrorClassifier.IsTransient(Exception)` does still return `true` for
|
||
`TaskCanceledException`/`OperationCanceledException`, but that is **correct and
|
||
intentional**: a `TaskCanceledException` raised by an HTTP timeout *is* a genuine
|
||
transient failure, and the only caller (`InvokeHttpAsync:238`) is unreachable for a
|
||
caller-cancellation because the two preceding `when`-filtered catches intercept it
|
||
first. The transient-vs-cancel decision is contextual (which token fired) and cannot
|
||
be made from the exception type alone, which is exactly why the call site does it.
|
||
No source change was required. A regression guard,
|
||
`CachedCall_CallerCancellation_IsNotBufferedAsTransient`, was added: it cancels the
|
||
caller token mid-`CachedCall` and asserts an `OperationCanceledException` is thrown and
|
||
the S&F buffer remains empty (the cancelled work is not retried). The existing
|
||
`Call_CallerCancellation_IsNotMisreportedAsTimeout` covers the synchronous `Call` path.
|
||
|
||
### ExternalSystemGateway-009 — `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: root cause subsumed by the ExternalSystemGateway-003 dispatch redesign (see Resolution) |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` assigns the result of `_storeAndForward.EnqueueAsync(...)` to
|
||
`sfResult` and then never reads it — it unconditionally returns
|
||
`new ExternalCallResult(true, null, null, WasBuffered: true)`. `EnqueueAsync` can
|
||
return `Success == false` (a permanent failure encountered during its
|
||
immediate-delivery attempt — `StoreAndForwardService.cs:142`) or `Buffered == false`
|
||
(delivered immediately). In both cases the ESG still reports the call as buffered and
|
||
successful to the script. A permanent failure surfaced by the S&F immediate attempt is
|
||
therefore silently lost instead of being returned to the script as the design requires
|
||
("On permanent failure (HTTP 4xx), the error is returned synchronously").
|
||
|
||
**Recommendation**
|
||
|
||
Inspect `sfResult`: if `Success == false` return an error `ExternalCallResult`; set
|
||
`WasBuffered` from `sfResult.Buffered` rather than hard-coding `true`. (This finding is
|
||
partly subsumed by the dispatch redesign in finding 003.)
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). **Re-triage:** the stated root cause — "a
|
||
permanent failure surfaced by `EnqueueAsync`'s immediate-delivery attempt is silently
|
||
lost" — **can no longer occur** in the current source, and the dead `sfResult` variable
|
||
the finding cites has already been removed. The `ExternalSystemGateway-003` fix changed
|
||
`CachedCallAsync` to call `EnqueueAsync` with `attemptImmediateDelivery: false`. With
|
||
that flag, `EnqueueAsync` never invokes the registered delivery handler: it skips the
|
||
immediate-delivery block entirely (so the `StoreAndForwardResult(false, …, …)`
|
||
permanent-failure return at `StoreAndForwardService.cs:147` is unreachable from this
|
||
caller) and unconditionally buffers, returning `Accepted: true, WasBuffered: true`
|
||
(`StoreAndForwardService.cs:180`). The `ExternalCallResult(true, null, null,
|
||
WasBuffered: true)` that `CachedCallAsync` returns is therefore now factually correct
|
||
in every reachable case — the message *is* buffered and there is no swallowed permanent
|
||
failure. Permanent HTTP 4xx failures are still surfaced synchronously, because
|
||
`CachedCallAsync` makes its own first HTTP attempt and catches
|
||
`PermanentExternalSystemException` *before* it ever reaches `EnqueueAsync`. No source
|
||
change was required beyond the `ExternalSystemGateway-003` redesign that already landed.
|
||
Coverage: `CachedCall_TransientFailure_BuffersWithSystemRetrySettings` asserts both
|
||
`result.WasBuffered == true` and that the message is genuinely present in the S&F buffer
|
||
(depth == 1), confirming the `WasBuffered: true` claim is not a lie; the existing
|
||
`CachedCall` permanent-failure path is exercised by `Call_Permanent400_ReturnsPermanentError`
|
||
semantics shared via `InvokeHttpAsync`.
|
||
|
||
### ExternalSystemGateway-010 — `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50` |
|
||
|
||
**Description**
|
||
|
||
`GetConnectionAsync` constructs `new SqlConnection(...)` and calls `await
|
||
connection.OpenAsync(...)`. If `OpenAsync` throws (unreachable server, bad
|
||
credentials, cancellation) the just-created `SqlConnection` instance is never disposed
|
||
— the exception propagates and the local reference is lost. While an unopened
|
||
`SqlConnection` is lightweight, over many failing calls this is an avoidable leak. The
|
||
design doc says `Database.Connection()` failures return an error to the script; the
|
||
current code lets a raw `SqlException` escape, which is acceptable, but the leak is
|
||
not.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the open in a try/catch that disposes the connection before rethrowing:
|
||
`try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `GetConnectionAsync` now wraps `OpenAsync` in a
|
||
`try/catch` that calls `await connection.DisposeAsync()` before rethrowing, so a failed
|
||
open (unreachable server, bad credentials, cancellation) no longer leaks the
|
||
`SqlConnection`. Connection creation was extracted into an `internal virtual
|
||
CreateConnection(string)` factory so the failure path is unit-testable. Regression test
|
||
`GetConnection_OpenFails_DisposesConnectionBeforeRethrowing` substitutes a `DbConnection`
|
||
whose `OpenAsync` always throws and asserts the connection is disposed when the
|
||
exception propagates; it was verified to fail before the `try/catch` was added.
|
||
|
||
### ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:360-374`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:169-176` |
|
||
|
||
**Description**
|
||
|
||
`ResolveSystemAndMethodAsync` calls `GetAllExternalSystemsAsync()` and then
|
||
`GetMethodsByExternalSystemIdAsync()` and filters in memory on every single call;
|
||
`ResolveConnectionAsync` calls `GetAllDatabaseConnectionsAsync()` and filters in memory
|
||
on every cached write / connection request. At sites this hits the SQLite repository,
|
||
and `SiteExternalSystemRepository` re-reads and re-parses the methods JSON each time.
|
||
For a hot script path this is unnecessary repeated I/O and allocation. Definitions only
|
||
change on deployment, so they are eminently cacheable.
|
||
|
||
**Recommendation**
|
||
|
||
Add an in-memory cache of system/method/connection definitions keyed by name,
|
||
invalidated on artifact deployment. Alternatively use a name-keyed repository lookup
|
||
rather than fetch-all-then-filter.
|
||
|
||
**Resolution**
|
||
|
||
2026-05-16 — **Root cause confirmed, but left Open: no correct fix is possible within
|
||
this module's edit scope.** `ResolveSystemAndMethodAsync`
|
||
(`ExternalSystemClient.cs:360`) does call `GetAllExternalSystemsAsync()` followed by
|
||
`GetMethodsByExternalSystemIdAsync()` and filters in memory, and
|
||
`ResolveConnectionAsync` (`DatabaseGateway.cs:169`) does `GetAllDatabaseConnectionsAsync()`
|
||
then filters — fetch-all-then-filter on every hot-path call, as described.
|
||
|
||
Both recommended fixes require changes outside `src/ScadaLink.ExternalSystemGateway`:
|
||
(a) a **name-keyed repository lookup** (e.g. `GetExternalSystemByNameAsync`) means adding
|
||
methods to `IExternalSystemRepository` in `ScadaLink.Commons` and implementing them in
|
||
`ScadaLink.ConfigurationDatabase` / `ScadaLink.SiteRuntime`; (b) an **in-memory cache
|
||
invalidated on artifact deployment** requires subscribing to a deployment-applied event
|
||
owned by `ScadaLink.SiteRuntime` / `ScadaLink.DeploymentManager`. A purely module-local
|
||
cache with a time-based TTL was rejected as a fix: definitions only change on deployment
|
||
and must reflect a deployment promptly, so a TTL would either be too short to help the
|
||
hot path or long enough to serve stale definitions after a redeploy — trading a
|
||
correctness hazard for a performance gain on a Low-severity issue. **Tracked follow-up:**
|
||
add a name-keyed lookup to `IExternalSystemRepository` (Commons) and have the gateway use
|
||
it, or add a deployment-invalidated definition cache wired from SiteRuntime. No source
|
||
change was made in this module.
|
||
|
||
### ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; `_logger` is injected but unused
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22` |
|
||
|
||
**Description**
|
||
|
||
The design doc states permanent failures are "Logged to Site Event Logging", but
|
||
`InvokeHttpAsync` performs no logging on the permanent-failure path. In fact the
|
||
injected `ILogger<ExternalSystemClient>` and `ILogger<DatabaseGateway>` fields are
|
||
never used at all in either class. Either the logging is expected to happen in the
|
||
caller (Script Execution Actor) — in which case the design doc is imprecise about
|
||
where — or it is missing. Separately, `IsTransient(HttpStatusCode)` treats any
|
||
non-success, non-(5xx/408/429) status as permanent without an explicit comment, which
|
||
is a reasonable default but undocumented.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `_logger.LogWarning` on the permanent-failure path (and a debug log on
|
||
transient), or clarify in the design doc that Site Event Logging capture is the
|
||
caller's responsibility and remove the unused `_logger` fields. Add a comment in
|
||
`ErrorClassifier` documenting the "default to permanent" behaviour.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: the
|
||
"`_logger` injected but unused" claim is **stale** — both loggers are already used (the
|
||
`DeliverBufferedAsync` retry-sweep handlers added by `ExternalSystemGateway-001` log
|
||
park/error events). The genuine remaining gap — `InvokeHttpAsync` performing no logging
|
||
on the HTTP-failure paths — is now fixed: `InvokeHttpAsync` emits a
|
||
`_logger.LogWarning` on the permanent-failure path (status code, system, method,
|
||
truncated error body) so a permanent failure is visible in Site Event Logging as the
|
||
design requires, and a `_logger.LogDebug` on the transient-failure path (transient
|
||
failures are normal retry/S&F operation and must not be noisy at warning level). A
|
||
documentation comment was added to `ErrorClassifier.IsTransient(HttpStatusCode)`
|
||
explaining the "any other non-success status defaults to permanent" behaviour and why
|
||
permanent is the safe default. Regression tests: `Call_PermanentFailure_LogsAWarning`
|
||
(asserts a warning carrying the system name is emitted; verified to fail before the
|
||
`LogWarning` was added) and `Call_TransientFailure_DoesNotLogAtWarningOrAbove` (guards
|
||
against over-logging transient failures).
|
||
|
||
### ExternalSystemGateway-013 — `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
|
||
|
||
**Description**
|
||
|
||
`ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem` (default 10) and
|
||
`DefaultHttpTimeout` (default 30s) are bound from configuration but neither is read
|
||
anywhere. `AddHttpClient()` registers the default factory with no
|
||
`ConfigurePrimaryHttpMessageHandler`/`SocketsHttpHandler` `MaxConnectionsPerServer` and
|
||
no `Timeout`, so both options have no effect. An operator setting these values gets
|
||
them silently ignored — a misleading configuration surface (`DefaultHttpTimeout` is
|
||
also referenced by finding 002).
|
||
|
||
**Recommendation**
|
||
|
||
Either wire the options into a named/typed `HttpClient` registration (set
|
||
`MaxConnectionsPerServer` on the primary handler, set `Timeout`), or remove the unused
|
||
options to avoid implying behaviour that does not exist.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text:
|
||
`DefaultHttpTimeout` is **no longer unused** — it became the effective per-call HTTP
|
||
round-trip limit when `ExternalSystemGateway-002` was fixed (`InvokeHttpAsync` builds a
|
||
linked `CancellationTokenSource(DefaultHttpTimeout)`). The genuinely-unused option,
|
||
`MaxConcurrentConnectionsPerSystem`, is now wired in: `AddExternalSystemGateway` adds a
|
||
`ConfigureHttpClientDefaults` registration that supplies a `SocketsHttpHandler` whose
|
||
`MaxConnectionsPerServer` is bound from the option, so it applies to every per-system
|
||
named client (`ExternalSystem_{name}`) the gateway creates rather than being silently
|
||
ignored. Regression test
|
||
`ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsAppliedToTheNamedHttpClientPrimaryHandler`
|
||
builds the DI container, resolves the named client's handler chain via
|
||
`IHttpMessageHandlerFactory`, walks to the primary handler and asserts
|
||
`SocketsHttpHandler.MaxConnectionsPerServer` equals the configured value; it was verified
|
||
to fail before the wiring was added.
|
||
|
||
### ExternalSystemGateway-014 — Cached-call buffering path and `DatabaseGateway` are untested
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1`, `tests/ScadaLink.ExternalSystemGateway.Tests/DatabaseGatewayTests.cs` |
|
||
|
||
**Description**
|
||
|
||
`ExternalSystemClientTests` covers system/method not-found, success, transient 500 and
|
||
permanent 400 for `CallAsync`, plus `CachedCall` not-found and success. It does **not**
|
||
cover: the `CachedCall` transient-failure → S&F buffering branch (the most
|
||
behaviour-rich path, including the `_storeAndForward == null` fallback and `WasBuffered`
|
||
semantics), the `CachedCall` permanent-failure branch, connection-exception
|
||
classification (`HttpRequestException` thrown by the handler), `BuildUrl` query-string
|
||
construction, and `ApplyAuth` for the apikey/basic variants. There is **no test file
|
||
for `DatabaseGateway`** at all — `GetConnectionAsync` not-found, `CachedWriteAsync`
|
||
not-found, and the `_storeAndForward == null` guard are entirely uncovered. The
|
||
`MockHttpMessageHandler` also does not assert request URL/headers/body, so auth and
|
||
URL construction are unverified.
|
||
|
||
**Recommendation**
|
||
|
||
Add tests for the `CachedCall` transient/buffering paths (with a substituted S&F
|
||
service), `DatabaseGateway` not-found and null-S&F guards, and `BuildUrl`/`ApplyAuth`
|
||
by asserting on the captured `HttpRequestMessage` in the mock handler.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: a number
|
||
of the listed gaps were **already closed** by tests added when findings 001–010 were
|
||
fixed — `DatabaseGatewayTests.cs` now exists (not-found, null-S&F guard, buffered-write
|
||
delivery), and the `CachedCall` transient/buffering and permanent paths are covered by
|
||
the `ExternalSystemGateway-001/003/004/008` regression tests. The remaining genuine
|
||
coverage gaps are now closed with new tests:
|
||
|
||
- `Call_GetWithParameters_AppendsEscapedQueryString` — `BuildUrl` GET query-string
|
||
construction, asserting on the captured request URI and that an `&` inside a value is
|
||
percent-escaped.
|
||
- `Call_PostWithParameters_SendsJsonBody` — POST JSON-body construction.
|
||
- `Call_ApiKeyAuthWithDefaultHeader_SendsXApiKeyHeader`,
|
||
`Call_ApiKeyAuthWithCustomHeader_SendsNamedHeader`,
|
||
`Call_BasicAuth_SendsBase64AuthorizationHeader` — `ApplyAuth` for all three auth
|
||
variants, asserting on the captured request headers.
|
||
- `Call_ConnectionError_IsClassifiedAsTransient` — a connection-level
|
||
`HttpRequestException` is classified transient.
|
||
- `CachedWrite_BuffersTheWriteWithConnectionRetrySettings` and
|
||
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` — the `DatabaseGateway`
|
||
`CachedWrite` happy-path enqueue against a real S&F service.
|
||
|
||
The shared `RequestCapturingHandler` test helper was extended to capture request
|
||
headers and body so URL/auth/body construction is now verified, not just status codes.
|
||
These are new-coverage tests against already-correct behaviour, so they pass on the
|
||
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
|
||
against regression.
|