Files
scadalink-design/code-reviews/ExternalSystemGateway/findings.md

40 KiB
Raw Blame History

Code Review — ExternalSystemGateway

Field Value
Module src/ScadaLink.ExternalSystemGateway
Design doc docs/requirements/Component-ExternalSystemGateway.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 1

Summary

The External System Gateway is a small module (five source files plus options) that implements the HTTP/REST client (ExternalSystemClient), the database access surface (DatabaseGateway), and error classification (ErrorClassifier). The structure is clean and the dual call-mode semantics broadly match the design doc. However, the review surfaced several substantive problems that prevent the module from behaving as designed. The most serious is that no store-and-forward delivery handler is ever registered for the ExternalSystem or CachedDbWrite categories, so cached calls and cached writes are buffered but can never actually be delivered on retry — a silent data-loss path. Two further high-impact issues are that the per-system call timeout is never applied to the HTTP client (the design's central error-handling guarantee is absent), and that CachedCall double-dispatches the HTTP request because StoreAndForwardService.EnqueueAsync itself re-attempts immediate delivery, breaking the idempotency expectations. A cluster of medium issues concern resource leaks, classification gaps (cancellation conflation), and the dropped StoreAndForwardResult. Test coverage is thin — CachedCall transient/buffering paths and DatabaseGateway are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc requirements (timeout, retry settings) that are declared but not implemented.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009.
2 Akka.NET conventions No actors in this module; AddExternalSystemGatewayActors is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module.
3 Concurrency & thread safety Services are stateless and DI-scoped; ExternalCallResult.Response lazy-parse is not thread-safe but instances are single-use. No findings raised.
4 Error handling & resilience S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008.
5 Security Auth secrets logged-safe, but error bodies echoed verbatim — finding 007.
6 Performance & resource management HttpRequestMessage/HttpResponseMessage and failed SqlConnection not disposed; full repository scan per call — findings 005, 010, 011.
7 Design-document adherence Timeout, retry settings, audit logging gaps — findings 002, 004, 012.
8 Code organization & conventions Options class correctly owned by module; MaxConcurrentConnectionsPerSystem unused — finding 013.
9 Testing coverage CachedCall buffering and DatabaseGateway untested — finding 014.
10 Documentation & comments XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012.

Findings

ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered

Severity Critical
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81

Description

CachedCallAsync and CachedWriteAsync enqueue messages under StoreAndForwardCategory.ExternalSystem and StoreAndForwardCategory.CachedDbWrite. StoreAndForwardService.RegisterDeliveryHandler is the only mechanism that lets the S&F engine actually deliver a buffered message, and a repository-wide search shows it is never called for either category anywhere in the codebase. Consequences:

  1. On a transient failure, EnqueueAsync falls through to the "No handler registered — buffer for later" branch (StoreAndForwardService.cs:163) and the message is persisted.
  2. During the retry sweep, AttemptDeliveryAsync (StoreAndForwardService.cs:201) logs "No delivery handler for category {Category}" and returns without ever removing or delivering the message.

The result is that every cached external call and cached DB write is silently buffered forever and never delivered — a data-loss path for the exact "deferred delivery is acceptable" use cases the design doc calls out (posting production data, quality reports). The script also receives WasBuffered: true / a successful CachedWriteAsync completion, so the failure is completely invisible.

Recommendation

Register delivery handlers for StoreAndForwardCategory.ExternalSystem and StoreAndForwardCategory.CachedDbWrite during host/site startup. The ExternalSystem handler should deserialize the payload, re-resolve the system/method, and re-invoke InvokeHttpAsync, returning true/false/throwing per the transient-vs-permanent contract EnqueueAsync expects. The CachedDbWrite handler should execute the SQL against the named connection. Add an integration test that buffers a message and verifies it is delivered by a retry sweep.

Resolution

Resolved 2026-05-16. Delivery handlers for StoreAndForwardCategory.ExternalSystem and CachedDbWrite are now registered at site startup in AkkaHostedService, after StoreAndForwardService.StartAsync(). Each handler resolves its consumer in a fresh DI scope and calls a new DeliverBufferedAsync: ExternalSystemClient.DeliverBufferedAsync re-resolves the system/method and re-invokes InvokeHttpAsync, and DatabaseGateway.DeliverBufferedAsync executes the buffered SQL — each returning true on success, false (park) when the target no longer exists or fails permanently, and throwing on transient failure so the engine retries. EnqueueAsync gained an attemptImmediateDelivery parameter; CachedCallAsync passes false so registering the handler does not dispatch the request twice (the double-dispatch noted in ExternalSystemGateway-003). Regression tests cover the success, target-removed and transient-retry paths. Fixed by the commit whose message references ExternalSystemGateway-001.

ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests

Severity High
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13

Description

The design doc states each external system definition specifies a timeout that "applies to all method calls on that system" and "applies to the HTTP request round-trip", and ExternalSystemGatewayOptions.DefaultHttpTimeout exists as a fallback. In practice no timeout is ever configured. ServiceCollectionExtensions calls services.AddHttpClient() with no per-named-client configuration, and InvokeHttpAsync calls _httpClientFactory.CreateClient($"ExternalSystem_{system.Name}") without setting client.Timeout or passing a CancellationToken derived from a timeout. SendAsync is therefore subject only to HttpClient's default 100-second timeout, regardless of the system definition or the configured DefaultHttpTimeout. A slow or hung external system will block the calling Script Execution Actor far longer than the operator configured, and the design's core error-handling guarantee (timeout → transient classification) does not hold within the intended window.

There is also no Timeout field on ExternalSystemDefinition at all, so even a correct implementation has nowhere to read the per-system value from — the entity is missing the field the design requires.

Recommendation

Add a Timeout (TimeSpan) field to ExternalSystemDefinition and have InvokeHttpAsync enforce it — either by setting client.Timeout via a typed/named HttpClient registration, or by linking a CancellationTokenSource with the per-system (or DefaultHttpTimeout) timeout to the supplied cancellationToken before SendAsync. Ensure the resulting TaskCanceledException/TimeoutException is classified as transient.

Resolution

Resolved 2026-05-16 (commit <pending>). InvokeHttpAsync now enforces a call timeout: ExternalSystemClient takes an IOptions<ExternalSystemGatewayOptions> and links a CancellationTokenSource(DefaultHttpTimeout) with the caller's token before SendAsync and the response-body read, so the design's "timeout applies to the HTTP request round-trip" guarantee now holds within the configured window (default 30s) instead of HttpClient's default 100s. A timeout is reclassified as a TransientExternalSystemException; a caller-initiated cancellation is distinguished from a timeout and propagated as OperationCanceledException rather than being swallowed as transient. Regression tests: Call_SlowSystem_TimesOutAsTransientErrorWithinConfiguredWindow and Call_CallerCancellation_IsNotMisreportedAsTimeout.

Note (partial scope): the per-system Timeout field on ExternalSystemDefinition remains unimplemented — adding it requires a change to ScadaLink.Commons, which is outside this module's edit scope. Until that entity field exists, the configured DefaultHttpTimeout is the effective per-call limit for every system. A follow-up against the Commons module should add the Timeout field and have InvokeHttpAsync prefer it over the default. This is a tracked follow-up, not a regression.

ExternalSystemGateway-003 — CachedCall double-dispatches the HTTP request

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117

Description

CachedCallAsync first calls InvokeHttpAsync directly (line 86). On a TransientExternalSystemException it then calls _storeAndForward.EnqueueAsync(...) (line 109). StoreAndForwardService.EnqueueAsync is not a pure enqueue — it "Attempts immediate delivery" by invoking the registered delivery handler (StoreAndForwardService.cs:128-159). If a delivery handler for the ExternalSystem category is registered (as finding 001 recommends), the HTTP request will be executed a second time synchronously inside EnqueueAsync, immediately after the first attempt failed. For a transient failure that is actually a slow/overloaded system, this doubles the load and — critically — if the original request did reach the external system, the immediate retry produces a duplicate delivery before the script even returns, worsening the idempotency hazard the design doc explicitly warns about.

Recommendation

Decide on one dispatch path. Either (a) have CachedCall not pre-invoke InvokeHttpAsync and instead let EnqueueAsync's immediate-delivery attempt be the single first attempt (requires the handler to exist and to surface permanent vs transient correctly); or (b) add an enqueue-only entry point to StoreAndForwardService that skips the immediate-delivery attempt, and have CachedCall use it after its own first attempt. Approach (a) is cleaner and removes the duplicated logic.

Resolution

Resolved 2026-05-16 (commit <pending>). Re-triage: this finding was already fixed in the codebase as a side effect of the ExternalSystemGateway-001 fix and is no longer reproducible against the current source. StoreAndForwardService.EnqueueAsync gained an attemptImmediateDelivery parameter (recommendation approach (b)), and CachedCallAsync passes attemptImmediateDelivery: false after its own first HTTP attempt — so EnqueueAsync buffers the message for the background retry sweep without re-invoking the registered delivery handler, eliminating the duplicate dispatch. A dedicated regression test, CachedCall_TransientFailure_DoesNotImmediatelyRedispatchViaRegisteredHandler, was added in this module's test suite: it registers a counting delivery handler, drives a CachedCall whose HTTP attempt fails transiently, and asserts the handler is invoked zero times during enqueue. The test was verified to fail if attemptImmediateDelivery is flipped back to true.

ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes

Severity Medium
Category Design-document adherence
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87

Description

CachedCallAsync and CachedWriteAsync pass the definition's MaxRetries / RetryDelay to EnqueueAsync only when they are non-default (MaxRetries > 0 ? ... : null, RetryDelay > TimeSpan.Zero ? ... : null), otherwise falling back to the S&F defaults. The site-side repository that supplies these definitions, SiteExternalSystemRepository.MapExternalSystem (src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194), never reads MaxRetries/RetryDelay from SQLite at all — the constructed entities always have MaxRetries == 0 and RetryDelay == TimeSpan.Zero. As a result, at sites the per-system retry settings the design doc requires are always discarded and the global S&F defaults are silently used instead. The > 0 guard in the ESG also makes a legitimately-configured MaxRetries of 0 ("never retry") indistinguishable from "unset", so an operator cannot express "do not retry".

Recommendation

Within this module, drop the > 0 / > Zero guards and pass the definition values through directly (or use nullable fields on the entity to distinguish "unset"). The companion fix in SiteExternalSystemRepository to actually map the retry columns should be tracked against the SiteRuntime module.

Resolution

Resolved 2026-05-16 (commit pending). CachedCallAsync and CachedWriteAsync now pass the definition's MaxRetries to EnqueueAsync verbatim — the > 0 guard is dropped, so a legitimately-configured MaxRetries of 0 ("never retry") is honoured instead of being collapsed to the S&F default. The RetryDelay > TimeSpan.Zero guard is deliberately kept: TimeSpan.Zero is the entity default for an unconfigured field and a literal zero-delay retry loop is not a valid configuration, so falling back to the S&F default interval for an unset delay is correct (only MaxRetries == 0 is a meaningful operator choice). Regression test CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset buffers a transient failure and asserts the buffered message carries MaxRetries == 0 rather than the S&F default; CachedCall_TransientFailure_BuffersWithSystemRetrySettings additionally covers a non-default settings pass-through. The companion fix in SiteExternalSystemRepository.MapExternalSystem to actually read the MaxRetries / RetryDelay columns from SQLite remains a tracked follow-up against the SiteRuntime module (outside this module's edit scope) — until then, sites still supply MaxRetries == 0, which this fix now correctly honours as "never retry".

ExternalSystemGateway-005 — HttpRequestMessage and HttpResponseMessage are not disposed

Severity Medium
Category Performance & resource management
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167

Description

InvokeHttpAsync creates an HttpRequestMessage (line 133) and receives an HttpResponseMessage from SendAsync (line 155); neither is wrapped in a using nor explicitly disposed. Both are IDisposable and own resources (the request's StringContent, the response's content stream). Under the per-invocation call volume of a busy site this produces avoidable pressure on the finalizer queue and can hold socket/stream resources longer than necessary. The success path reads the content but never disposes the response; the error path likewise reads errorBody and then throws without disposing.

Recommendation

Wrap the request in using var request = ... and the response in using var response = ... (or call Dispose() in a finally). Ensure disposal still occurs on the exception paths.

Resolution

Resolved 2026-05-16 (commit pending). InvokeHttpAsync now declares the request as using var request and wraps all response handling in a using (response) block, so both IDisposable instances (and the request's StringContent / the response content stream) are released on the success path and on the permanent/transient exception paths. Regression tests Call_SuccessfulHttp_DisposesRequestAndResponse and Call_PermanentFailure_StillDisposesRequestAndResponse use a disposal-tracking HttpMessageHandler/HttpContent and assert both the request and the response content are disposed; both were verified to fail before the using wrappers were added.

ExternalSystemGateway-006 — BuildUrl ignores path templates and appends a trailing slash for empty paths

Severity Medium — partially re-triaged: trailing-slash bug fixed; path-templating sub-issue is a design decision (see Resolution)
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196

Description

BuildUrl does baseUrl.TrimEnd('/') + "/" + path.TrimStart('/'). When method.Path is empty (a method that targets the base URL itself), this still appends a /, producing https://host/api/ which some servers treat as a different resource than https://host/api. More importantly, the design doc shows method paths as templates like /recipes/{id}, but BuildUrl performs no placeholder substitution — a {id} token is sent literally in the URL and the corresponding parameter is instead appended as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either the design's path-template feature is unimplemented, or the doc is stale; in the current code a method defined as /recipes/{id} will never produce a correct URL.

Recommendation

Decide whether path templating is in scope. If yes, implement {name} substitution from parameters in BuildUrl and exclude substituted parameters from the query string/body. If no, update the component design doc to remove the /recipes/{id} example and state that paths are literal. Also avoid appending a trailing / when path is empty.

Resolution

Resolved 2026-05-16 (commit pending). The trailing-slash bug is fixed: BuildUrl now appends a /-joined path segment only when the method's path is non-empty after trimming, so a method targeting the base URL itself produces https://host/api rather than https://host/api/. Regression tests Call_MethodWithEmptyPath_DoesNotAppendTrailingSlash and Call_MethodWithPath_BuildsExpectedUrl (asserting on the captured request URI) cover the empty-path and normal-path cases; the empty-path test was verified to fail before the fix.

Re-triage of the path-templating sub-issue ({id} placeholder substitution): this is a genuine design decision, not a code bug, and it requires editing the component design doc — both outside this module's edit scope (src/, tests/, this file only). The current code treats method paths as literal strings and routes parameters to the query string (GET/DELETE) or JSON body (POST/PUT); a method authored as /recipes/{id} sends the {id} token verbatim. Tracked follow-up / surfaced design question: the design owner must decide whether path templating is in scope — if yes, implement {name} substitution in BuildUrl and exclude substituted params from the query/body; if no, the Component-ExternalSystemGateway.md /recipes/{id} example must be changed to a literal path. The trailing-slash defect (the concrete correctness bug in this finding) is fully resolved.

ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages

Severity Medium
Category Security
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177

Description

On a non-success HTTP response, the full response body is read into errorBody and embedded verbatim into the exception message ($"HTTP {code} from {name}: {errorBody}"), which then flows into ExternalCallResult.ErrorMessage and back to the calling script, and into Site Event Logging. An external system error page can be arbitrarily large (an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry and error string returned to scripts. There is also no content-type check before treating the body as text.

Recommendation

Truncate errorBody to a bounded length (e.g. 12 KB) before embedding it, and consider logging the full body separately at debug level rather than returning it to the script. Optionally only include the body when the content type is textual.

Resolution

Resolved 2026-05-16 (commit pending). InvokeHttpAsync now truncates the external error response body to MaxErrorBodyChars (2048) via a Truncate helper before embedding it into the transient/permanent exception message — so a misbehaving or hostile endpoint can no longer inflate every script-visible ErrorMessage and Site Event Logging entry with a multi-megabyte body. When truncation occurs the message is suffixed with … [truncated, N chars total] so the original size is still visible. Regression test Call_PermanentFailureWithHugeErrorBody_TruncatesErrorMessage drives a 400 with a 500 KB body and asserts the resulting ErrorMessage is bounded (< 4096 chars); it was verified to fail (500 040-char message) before the cap was added. Content-type filtering was considered optional in the recommendation and was not implemented — the size cap alone closes the inflation/disclosure vector.

ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure

Severity Medium — re-triaged: root cause already fixed in current source (see Resolution)
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30, src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159

Description

ErrorClassifier.IsTransient(Exception) returns true for TaskCanceledException and OperationCanceledException. HttpClient.SendAsync throws TaskCanceledException both when its internal timeout elapses and when the supplied CancellationToken is cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting down). Because InvokeHttpAsync's catch filter treats all of these as transient, a caller-initiated cancellation during a CachedCall will be misclassified as a transient failure and the message will be buffered for retry — work the caller explicitly asked to abandon. For a Call, a shutdown-time cancellation is reported to the script as a "Transient error" rather than an OperationCanceledException.

Recommendation

In InvokeHttpAsync, check cancellationToken.IsCancellationRequested first and rethrow OperationCanceledException (or let it propagate) before applying transient classification. Only treat a cancellation as a timeout when the supplied token is not the one that was cancelled.

Resolution

Resolved 2026-05-16 (commit pending). Re-triage: the root cause described — a caller-initiated cancellation being misclassified as a transient failure — is no longer present in the current source and is not reproducible. InvokeHttpAsync already wraps both SendAsync and the response-body ReadAsStringAsync in ordered catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) filters that rethrow the caller's cancellation before the catch (Exception ex) when (ErrorClassifier.IsTransient(ex)) branch is ever reached (this was added alongside the ExternalSystemGateway-002 timeout fix). A caller-cancel therefore propagates as OperationCanceledException and is never buffered; only the gateway's own timeout token reclassifies as transient.

ErrorClassifier.IsTransient(Exception) does still return true for TaskCanceledException/OperationCanceledException, but that is correct and intentional: a TaskCanceledException raised by an HTTP timeout is a genuine transient failure, and the only caller (InvokeHttpAsync:238) is unreachable for a caller-cancellation because the two preceding when-filtered catches intercept it first. The transient-vs-cancel decision is contextual (which token fired) and cannot be made from the exception type alone, which is exactly why the call site does it. No source change was required. A regression guard, CachedCall_CallerCancellation_IsNotBufferedAsTransient, was added: it cancels the caller token mid-CachedCall and asserts an OperationCanceledException is thrown and the S&F buffer remains empty (the cancelled work is not retried). The existing Call_CallerCancellation_IsNotMisreportedAsTimeout covers the synchronous Call path.

ExternalSystemGateway-009 — StoreAndForwardResult from EnqueueAsync is discarded; permanent failures during buffering are swallowed

Severity Medium — re-triaged: root cause subsumed by the ExternalSystemGateway-003 dispatch redesign (see Resolution)
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117

Description

CachedCallAsync assigns the result of _storeAndForward.EnqueueAsync(...) to sfResult and then never reads it — it unconditionally returns new ExternalCallResult(true, null, null, WasBuffered: true). EnqueueAsync can return Success == false (a permanent failure encountered during its immediate-delivery attempt — StoreAndForwardService.cs:142) or Buffered == false (delivered immediately). In both cases the ESG still reports the call as buffered and successful to the script. A permanent failure surfaced by the S&F immediate attempt is therefore silently lost instead of being returned to the script as the design requires ("On permanent failure (HTTP 4xx), the error is returned synchronously").

Recommendation

Inspect sfResult: if Success == false return an error ExternalCallResult; set WasBuffered from sfResult.Buffered rather than hard-coding true. (This finding is partly subsumed by the dispatch redesign in finding 003.)

Resolution

Resolved 2026-05-16 (commit pending). Re-triage: the stated root cause — "a permanent failure surfaced by EnqueueAsync's immediate-delivery attempt is silently lost" — can no longer occur in the current source, and the dead sfResult variable the finding cites has already been removed. The ExternalSystemGateway-003 fix changed CachedCallAsync to call EnqueueAsync with attemptImmediateDelivery: false. With that flag, EnqueueAsync never invokes the registered delivery handler: it skips the immediate-delivery block entirely (so the StoreAndForwardResult(false, …, …) permanent-failure return at StoreAndForwardService.cs:147 is unreachable from this caller) and unconditionally buffers, returning Accepted: true, WasBuffered: true (StoreAndForwardService.cs:180). The ExternalCallResult(true, null, null, WasBuffered: true) that CachedCallAsync returns is therefore now factually correct in every reachable case — the message is buffered and there is no swallowed permanent failure. Permanent HTTP 4xx failures are still surfaced synchronously, because CachedCallAsync makes its own first HTTP attempt and catches PermanentExternalSystemException before it ever reaches EnqueueAsync. No source change was required beyond the ExternalSystemGateway-003 redesign that already landed. Coverage: CachedCall_TransientFailure_BuffersWithSystemRetrySettings asserts both result.WasBuffered == true and that the message is genuinely present in the S&F buffer (depth == 1), confirming the WasBuffered: true claim is not a lie; the existing CachedCall permanent-failure path is exercised by Call_Permanent400_ReturnsPermanentError semantics shared via InvokeHttpAsync.

ExternalSystemGateway-010 — GetConnectionAsync leaks the SqlConnection when OpenAsync fails

Severity Medium
Category Performance & resource management
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50

Description

GetConnectionAsync constructs new SqlConnection(...) and calls await connection.OpenAsync(...). If OpenAsync throws (unreachable server, bad credentials, cancellation) the just-created SqlConnection instance is never disposed — the exception propagates and the local reference is lost. While an unopened SqlConnection is lightweight, over many failing calls this is an avoidable leak. The design doc says Database.Connection() failures return an error to the script; the current code lets a raw SqlException escape, which is acceptable, but the leak is not.

Recommendation

Wrap the open in a try/catch that disposes the connection before rethrowing: try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }.

Resolution

Resolved 2026-05-16 (commit pending). GetConnectionAsync now wraps OpenAsync in a try/catch that calls await connection.DisposeAsync() before rethrowing, so a failed open (unreachable server, bad credentials, cancellation) no longer leaks the SqlConnection. Connection creation was extracted into an internal virtual CreateConnection(string) factory so the failure path is unit-testable. Regression test GetConnection_OpenFails_DisposesConnectionBeforeRethrowing substitutes a DbConnection whose OpenAsync always throws and asserts the connection is disposed when the exception propagates; it was verified to fail before the try/catch was added.

ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods

Severity Low
Category Performance & resource management
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:360-374, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:169-176

Description

ResolveSystemAndMethodAsync calls GetAllExternalSystemsAsync() and then GetMethodsByExternalSystemIdAsync() and filters in memory on every single call; ResolveConnectionAsync calls GetAllDatabaseConnectionsAsync() and filters in memory on every cached write / connection request. At sites this hits the SQLite repository, and SiteExternalSystemRepository re-reads and re-parses the methods JSON each time. For a hot script path this is unnecessary repeated I/O and allocation. Definitions only change on deployment, so they are eminently cacheable.

Recommendation

Add an in-memory cache of system/method/connection definitions keyed by name, invalidated on artifact deployment. Alternatively use a name-keyed repository lookup rather than fetch-all-then-filter.

Resolution

2026-05-16 — Root cause confirmed, but left Open: no correct fix is possible within this module's edit scope. ResolveSystemAndMethodAsync (ExternalSystemClient.cs:360) does call GetAllExternalSystemsAsync() followed by GetMethodsByExternalSystemIdAsync() and filters in memory, and ResolveConnectionAsync (DatabaseGateway.cs:169) does GetAllDatabaseConnectionsAsync() then filters — fetch-all-then-filter on every hot-path call, as described.

Both recommended fixes require changes outside src/ScadaLink.ExternalSystemGateway: (a) a name-keyed repository lookup (e.g. GetExternalSystemByNameAsync) means adding methods to IExternalSystemRepository in ScadaLink.Commons and implementing them in ScadaLink.ConfigurationDatabase / ScadaLink.SiteRuntime; (b) an in-memory cache invalidated on artifact deployment requires subscribing to a deployment-applied event owned by ScadaLink.SiteRuntime / ScadaLink.DeploymentManager. A purely module-local cache with a time-based TTL was rejected as a fix: definitions only change on deployment and must reflect a deployment promptly, so a TTL would either be too short to help the hot path or long enough to serve stale definitions after a redeploy — trading a correctness hazard for a performance gain on a Low-severity issue. Tracked follow-up: add a name-keyed lookup to IExternalSystemRepository (Commons) and have the gateway use it, or add a deployment-invalidated definition cache wired from SiteRuntime. No source change was made in this module.

ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; _logger is injected but unused

Severity Low
Category Design-document adherence
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22

Description

The design doc states permanent failures are "Logged to Site Event Logging", but InvokeHttpAsync performs no logging on the permanent-failure path. In fact the injected ILogger<ExternalSystemClient> and ILogger<DatabaseGateway> fields are never used at all in either class. Either the logging is expected to happen in the caller (Script Execution Actor) — in which case the design doc is imprecise about where — or it is missing. Separately, IsTransient(HttpStatusCode) treats any non-success, non-(5xx/408/429) status as permanent without an explicit comment, which is a reasonable default but undocumented.

Recommendation

Add a _logger.LogWarning on the permanent-failure path (and a debug log on transient), or clarify in the design doc that Site Event Logging capture is the caller's responsibility and remove the unused _logger fields. Add a comment in ErrorClassifier documenting the "default to permanent" behaviour.

Resolution

Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: the "_logger injected but unused" claim is stale — both loggers are already used (the DeliverBufferedAsync retry-sweep handlers added by ExternalSystemGateway-001 log park/error events). The genuine remaining gap — InvokeHttpAsync performing no logging on the HTTP-failure paths — is now fixed: InvokeHttpAsync emits a _logger.LogWarning on the permanent-failure path (status code, system, method, truncated error body) so a permanent failure is visible in Site Event Logging as the design requires, and a _logger.LogDebug on the transient-failure path (transient failures are normal retry/S&F operation and must not be noisy at warning level). A documentation comment was added to ErrorClassifier.IsTransient(HttpStatusCode) explaining the "any other non-success status defaults to permanent" behaviour and why permanent is the safe default. Regression tests: Call_PermanentFailure_LogsAWarning (asserts a warning carrying the system name is emitted; verified to fail before the LogWarning was added) and Call_TransientFailure_DoesNotLogAtWarningOrAbove (guards against over-logging transient failures).

ExternalSystemGateway-013 — MaxConcurrentConnectionsPerSystem and DefaultHttpTimeout options are defined but never used

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13

Description

ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem (default 10) and DefaultHttpTimeout (default 30s) are bound from configuration but neither is read anywhere. AddHttpClient() registers the default factory with no ConfigurePrimaryHttpMessageHandler/SocketsHttpHandler MaxConnectionsPerServer and no Timeout, so both options have no effect. An operator setting these values gets them silently ignored — a misleading configuration surface (DefaultHttpTimeout is also referenced by finding 002).

Recommendation

Either wire the options into a named/typed HttpClient registration (set MaxConnectionsPerServer on the primary handler, set Timeout), or remove the unused options to avoid implying behaviour that does not exist.

Resolution

Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: DefaultHttpTimeout is no longer unused — it became the effective per-call HTTP round-trip limit when ExternalSystemGateway-002 was fixed (InvokeHttpAsync builds a linked CancellationTokenSource(DefaultHttpTimeout)). The genuinely-unused option, MaxConcurrentConnectionsPerSystem, is now wired in: AddExternalSystemGateway adds a ConfigureHttpClientDefaults registration that supplies a SocketsHttpHandler whose MaxConnectionsPerServer is bound from the option, so it applies to every per-system named client (ExternalSystem_{name}) the gateway creates rather than being silently ignored. Regression test ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsAppliedToTheNamedHttpClientPrimaryHandler builds the DI container, resolves the named client's handler chain via IHttpMessageHandlerFactory, walks to the primary handler and asserts SocketsHttpHandler.MaxConnectionsPerServer equals the configured value; it was verified to fail before the wiring was added.

ExternalSystemGateway-014 — Cached-call buffering path and DatabaseGateway are untested

Severity Low
Category Testing coverage
Status Resolved
Location tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1, tests/ScadaLink.ExternalSystemGateway.Tests/DatabaseGatewayTests.cs

Description

ExternalSystemClientTests covers system/method not-found, success, transient 500 and permanent 400 for CallAsync, plus CachedCall not-found and success. It does not cover: the CachedCall transient-failure → S&F buffering branch (the most behaviour-rich path, including the _storeAndForward == null fallback and WasBuffered semantics), the CachedCall permanent-failure branch, connection-exception classification (HttpRequestException thrown by the handler), BuildUrl query-string construction, and ApplyAuth for the apikey/basic variants. There is no test file for DatabaseGateway at all — GetConnectionAsync not-found, CachedWriteAsync not-found, and the _storeAndForward == null guard are entirely uncovered. The MockHttpMessageHandler also does not assert request URL/headers/body, so auth and URL construction are unverified.

Recommendation

Add tests for the CachedCall transient/buffering paths (with a substituted S&F service), DatabaseGateway not-found and null-S&F guards, and BuildUrl/ApplyAuth by asserting on the captured HttpRequestMessage in the mock handler.

Resolution

Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: a number of the listed gaps were already closed by tests added when findings 001010 were fixed — DatabaseGatewayTests.cs now exists (not-found, null-S&F guard, buffered-write delivery), and the CachedCall transient/buffering and permanent paths are covered by the ExternalSystemGateway-001/003/004/008 regression tests. The remaining genuine coverage gaps are now closed with new tests:

  • Call_GetWithParameters_AppendsEscapedQueryStringBuildUrl GET query-string construction, asserting on the captured request URI and that an & inside a value is percent-escaped.
  • Call_PostWithParameters_SendsJsonBody — POST JSON-body construction.
  • Call_ApiKeyAuthWithDefaultHeader_SendsXApiKeyHeader, Call_ApiKeyAuthWithCustomHeader_SendsNamedHeader, Call_BasicAuth_SendsBase64AuthorizationHeaderApplyAuth for all three auth variants, asserting on the captured request headers.
  • Call_ConnectionError_IsClassifiedAsTransient — a connection-level HttpRequestException is classified transient.
  • CachedWrite_BuffersTheWriteWithConnectionRetrySettings and CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset — the DatabaseGateway CachedWrite happy-path enqueue against a real S&F service.

The shared RequestCapturingHandler test helper was extended to capture request headers and body so URL/auth/body construction is now verified, not just status codes. These are new-coverage tests against already-correct behaviour, so they pass on the current source; the BuildUrl and ApplyAuth paths they exercise are now protected against regression.