fix(external-system-gateway): resolve ExternalSystemGateway-004..010 — honour retry settings, dispose HTTP messages, fix URL building, truncate error bodies, fix connection leak

This commit is contained in:
Joseph Doherty
2026-05-16 21:11:24 -04:00
parent 8c67ffad2a
commit 2502e4d10a
5 changed files with 615 additions and 52 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
| Open findings | 4 |
## Summary
@@ -215,7 +215,7 @@ is flipped back to `true`.
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87` |
**Description**
@@ -242,7 +242,21 @@ should be tracked against the SiteRuntime module.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). `CachedCallAsync` and `CachedWriteAsync` now pass
the definition's `MaxRetries` to `EnqueueAsync` verbatim — the `> 0` guard is dropped, so
a legitimately-configured `MaxRetries` of 0 ("never retry") is honoured instead of being
collapsed to the S&F default. The `RetryDelay > TimeSpan.Zero` guard is deliberately
**kept**: `TimeSpan.Zero` is the entity default for an unconfigured field and a literal
zero-delay retry loop is not a valid configuration, so falling back to the S&F default
interval for an unset delay is correct (only `MaxRetries == 0` is a meaningful operator
choice). Regression test `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`
buffers a transient failure and asserts the buffered message carries `MaxRetries == 0`
rather than the S&F default; `CachedCall_TransientFailure_BuffersWithSystemRetrySettings`
additionally covers a non-default settings pass-through. The companion fix in
`SiteExternalSystemRepository.MapExternalSystem` to actually read the `MaxRetries` /
`RetryDelay` columns from SQLite remains a tracked follow-up against the SiteRuntime
module (outside this module's edit scope) — until then, sites still supply
`MaxRetries == 0`, which this fix now correctly honours as "never retry".
### ExternalSystemGateway-005 — `HttpRequestMessage` and `HttpResponseMessage` are not disposed
@@ -250,7 +264,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167` |
**Description**
@@ -272,15 +286,22 @@ occurs on the exception paths.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now declares the request as
`using var request` and wraps all response handling in a `using (response)` block, so
both `IDisposable` instances (and the request's `StringContent` / the response content
stream) are released on the success path **and** on the permanent/transient
exception paths. Regression tests `Call_SuccessfulHttp_DisposesRequestAndResponse` and
`Call_PermanentFailure_StillDisposesRequestAndResponse` use a disposal-tracking
`HttpMessageHandler`/`HttpContent` and assert both the request and the response content
are disposed; both were verified to fail before the `using` wrappers were added.
### ExternalSystemGateway-006 — `BuildUrl` ignores path templates and appends a trailing slash for empty paths
| | |
|--|--|
| Severity | Medium |
| Severity | Medium — partially re-triaged: trailing-slash bug fixed; path-templating sub-issue is a design decision (see Resolution) |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196` |
**Description**
@@ -305,7 +326,25 @@ example and state that paths are literal. Also avoid appending a trailing `/` wh
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). The **trailing-slash bug** is fixed: `BuildUrl`
now appends a `/`-joined path segment only when the method's path is non-empty after
trimming, so a method targeting the base URL itself produces `https://host/api` rather
than `https://host/api/`. Regression tests `Call_MethodWithEmptyPath_DoesNotAppendTrailingSlash`
and `Call_MethodWithPath_BuildsExpectedUrl` (asserting on the captured request URI)
cover the empty-path and normal-path cases; the empty-path test was verified to fail
before the fix.
Re-triage of the **path-templating sub-issue** (`{id}` placeholder substitution): this
is a genuine design decision, not a code bug, and it requires editing the component
design doc — both outside this module's edit scope (`src/`, `tests/`, this file only).
The current code treats method paths as literal strings and routes parameters to the
query string (GET/DELETE) or JSON body (POST/PUT); a method authored as `/recipes/{id}`
sends the `{id}` token verbatim. **Tracked follow-up / surfaced design question:** the
design owner must decide whether path templating is in scope — if yes, implement
`{name}` substitution in `BuildUrl` and exclude substituted params from the
query/body; if no, the `Component-ExternalSystemGateway.md` `/recipes/{id}` example must
be changed to a literal path. The trailing-slash defect (the concrete correctness bug
in this finding) is fully resolved.
### ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
@@ -313,7 +352,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177` |
**Description**
@@ -335,15 +374,25 @@ the script. Optionally only include the body when the content type is textual.
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now truncates the external error
response body to `MaxErrorBodyChars` (2048) via a `Truncate` helper before embedding it
into the transient/permanent exception message — so a misbehaving or hostile endpoint
can no longer inflate every script-visible `ErrorMessage` and Site Event Logging entry
with a multi-megabyte body. When truncation occurs the message is suffixed with
`… [truncated, N chars total]` so the original size is still visible. Regression test
`Call_PermanentFailureWithHugeErrorBody_TruncatesErrorMessage` drives a 400 with a
500 KB body and asserts the resulting `ErrorMessage` is bounded (< 4096 chars); it was
verified to fail (500 040-char message) before the cap was added. Content-type
filtering was considered optional in the recommendation and was not implemented — the
size cap alone closes the inflation/disclosure vector.
### ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
| | |
|--|--|
| Severity | Medium |
| Severity | Medium — re-triaged: root cause already fixed in current source (see Resolution) |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30`, `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159` |
**Description**
@@ -367,15 +416,37 @@ classification. Only treat a cancellation as a timeout when the supplied token i
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). **Re-triage:** the root cause described — a
caller-initiated cancellation being misclassified as a transient failure — is **no
longer present in the current source** and is not reproducible. `InvokeHttpAsync`
already wraps both `SendAsync` and the response-body `ReadAsStringAsync` in ordered
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
filters that rethrow the caller's cancellation *before* the
`catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch is ever reached
(this was added alongside the `ExternalSystemGateway-002` timeout fix). A caller-cancel
therefore propagates as `OperationCanceledException` and is never buffered; only the
gateway's own timeout token reclassifies as transient.
`ErrorClassifier.IsTransient(Exception)` does still return `true` for
`TaskCanceledException`/`OperationCanceledException`, but that is **correct and
intentional**: a `TaskCanceledException` raised by an HTTP timeout *is* a genuine
transient failure, and the only caller (`InvokeHttpAsync:238`) is unreachable for a
caller-cancellation because the two preceding `when`-filtered catches intercept it
first. The transient-vs-cancel decision is contextual (which token fired) and cannot
be made from the exception type alone, which is exactly why the call site does it.
No source change was required. A regression guard,
`CachedCall_CallerCancellation_IsNotBufferedAsTransient`, was added: it cancels the
caller token mid-`CachedCall` and asserts an `OperationCanceledException` is thrown and
the S&F buffer remains empty (the cancelled work is not retried). The existing
`Call_CallerCancellation_IsNotMisreportedAsTimeout` covers the synchronous `Call` path.
### ExternalSystemGateway-009 — `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed
| | |
|--|--|
| Severity | Medium |
| Severity | Medium — re-triaged: root cause subsumed by the ExternalSystemGateway-003 dispatch redesign (see Resolution) |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117` |
**Description**
@@ -398,7 +469,27 @@ partly subsumed by the dispatch redesign in finding 003.)
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). **Re-triage:** the stated root cause — "a
permanent failure surfaced by `EnqueueAsync`'s immediate-delivery attempt is silently
lost" — **can no longer occur** in the current source, and the dead `sfResult` variable
the finding cites has already been removed. The `ExternalSystemGateway-003` fix changed
`CachedCallAsync` to call `EnqueueAsync` with `attemptImmediateDelivery: false`. With
that flag, `EnqueueAsync` never invokes the registered delivery handler: it skips the
immediate-delivery block entirely (so the `StoreAndForwardResult(false, …, …)`
permanent-failure return at `StoreAndForwardService.cs:147` is unreachable from this
caller) and unconditionally buffers, returning `Accepted: true, WasBuffered: true`
(`StoreAndForwardService.cs:180`). The `ExternalCallResult(true, null, null,
WasBuffered: true)` that `CachedCallAsync` returns is therefore now factually correct
in every reachable case — the message *is* buffered and there is no swallowed permanent
failure. Permanent HTTP 4xx failures are still surfaced synchronously, because
`CachedCallAsync` makes its own first HTTP attempt and catches
`PermanentExternalSystemException` *before* it ever reaches `EnqueueAsync`. No source
change was required beyond the `ExternalSystemGateway-003` redesign that already landed.
Coverage: `CachedCall_TransientFailure_BuffersWithSystemRetrySettings` asserts both
`result.WasBuffered == true` and that the message is genuinely present in the S&F buffer
(depth == 1), confirming the `WasBuffered: true` claim is not a lie; the existing
`CachedCall` permanent-failure path is exercised by `Call_Permanent400_ReturnsPermanentError`
semantics shared via `InvokeHttpAsync`.
### ExternalSystemGateway-010 — `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails
@@ -406,7 +497,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50` |
**Description**
@@ -427,7 +518,14 @@ Wrap the open in a try/catch that disposes the connection before rethrowing:
**Resolution**
_Unresolved._
Resolved 2026-05-16 (commit pending). `GetConnectionAsync` now wraps `OpenAsync` in a
`try/catch` that calls `await connection.DisposeAsync()` before rethrowing, so a failed
open (unreachable server, bad credentials, cancellation) no longer leaks the
`SqlConnection`. Connection creation was extracted into an `internal virtual
CreateConnection(string)` factory so the failure path is unit-testable. Regression test
`GetConnection_OpenFails_DisposesConnectionBeforeRethrowing` substitutes a `DbConnection`
whose `OpenAsync` always throws and asserts the connection is disposed when the
exception propagates; it was verified to fail before the `try/catch` was added.
### ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods