27 KiB
Code Review — ExternalSystemGateway
| Field | Value |
|---|---|
| Module | src/ScadaLink.ExternalSystemGateway |
| Design doc | docs/requirements/Component-ExternalSystemGateway.md |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | 9c60592 |
| Open findings | 11 |
Summary
The External System Gateway is a small module (five source files plus options) that
implements the HTTP/REST client (ExternalSystemClient), the database access surface
(DatabaseGateway), and error classification (ErrorClassifier). The structure is
clean and the dual call-mode semantics broadly match the design doc. However, the
review surfaced several substantive problems that prevent the module from behaving as
designed. The most serious is that no store-and-forward delivery handler is ever
registered for the ExternalSystem or CachedDbWrite categories, so cached calls
and cached writes are buffered but can never actually be delivered on retry — a silent
data-loss path. Two further high-impact issues are that the per-system call timeout
is never applied to the HTTP client (the design's central error-handling guarantee
is absent), and that CachedCall double-dispatches the HTTP request because
StoreAndForwardService.EnqueueAsync itself re-attempts immediate delivery, breaking
the idempotency expectations. A cluster of medium issues concern resource leaks,
classification gaps (cancellation conflation), and the dropped StoreAndForwardResult.
Test coverage is thin — CachedCall transient/buffering paths and DatabaseGateway
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
requirements (timeout, retry settings) that are declared but not implemented.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
| 2 | Akka.NET conventions | ☑ | No actors in this module; AddExternalSystemGatewayActors is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; ExternalCallResult.Response lazy-parse is not thread-safe but instances are single-use. No findings raised. |
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
| 6 | Performance & resource management | ☑ | HttpRequestMessage/HttpResponseMessage and failed SqlConnection not disposed; full repository scan per call — findings 005, 010, 011. |
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; MaxConcurrentConnectionsPerSystem unused — finding 013. |
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
Findings
ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81 |
Description
CachedCallAsync and CachedWriteAsync enqueue messages under
StoreAndForwardCategory.ExternalSystem and StoreAndForwardCategory.CachedDbWrite.
StoreAndForwardService.RegisterDeliveryHandler is the only mechanism that lets the
S&F engine actually deliver a buffered message, and a repository-wide search shows it
is never called for either category anywhere in the codebase. Consequences:
- On a transient failure,
EnqueueAsyncfalls through to the "No handler registered — buffer for later" branch (StoreAndForwardService.cs:163) and the message is persisted. - During the retry sweep,
AttemptDeliveryAsync(StoreAndForwardService.cs:201) logs"No delivery handler for category {Category}"and returns without ever removing or delivering the message.
The result is that every cached external call and cached DB write is silently
buffered forever and never delivered — a data-loss path for the exact "deferred
delivery is acceptable" use cases the design doc calls out (posting production data,
quality reports). The script also receives WasBuffered: true / a successful
CachedWriteAsync completion, so the failure is completely invisible.
Recommendation
Register delivery handlers for StoreAndForwardCategory.ExternalSystem and
StoreAndForwardCategory.CachedDbWrite during host/site startup. The ExternalSystem
handler should deserialize the payload, re-resolve the system/method, and re-invoke
InvokeHttpAsync, returning true/false/throwing per the transient-vs-permanent
contract EnqueueAsync expects. The CachedDbWrite handler should execute the SQL
against the named connection. Add an integration test that buffers a message and
verifies it is delivered by a retry sweep.
Resolution
Resolved 2026-05-16. Delivery handlers for StoreAndForwardCategory.ExternalSystem and
CachedDbWrite are now registered at site startup in AkkaHostedService, after
StoreAndForwardService.StartAsync(). Each handler resolves its consumer in a fresh DI
scope and calls a new DeliverBufferedAsync: ExternalSystemClient.DeliverBufferedAsync
re-resolves the system/method and re-invokes InvokeHttpAsync, and
DatabaseGateway.DeliverBufferedAsync executes the buffered SQL — each returning true
on success, false (park) when the target no longer exists or fails permanently, and
throwing on transient failure so the engine retries. EnqueueAsync gained an
attemptImmediateDelivery parameter; CachedCallAsync passes false so registering the
handler does not dispatch the request twice (the double-dispatch noted in
ExternalSystemGateway-003). Regression tests cover the success, target-removed and
transient-retry paths. Fixed by the commit whose message references
ExternalSystemGateway-001.
ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13 |
Description
The design doc states each external system definition specifies a timeout that
"applies to all method calls on that system" and "applies to the HTTP request
round-trip", and ExternalSystemGatewayOptions.DefaultHttpTimeout exists as a
fallback. In practice no timeout is ever configured. ServiceCollectionExtensions
calls services.AddHttpClient() with no per-named-client configuration, and
InvokeHttpAsync calls _httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")
without setting client.Timeout or passing a CancellationToken derived from a
timeout. SendAsync is therefore subject only to HttpClient's default 100-second
timeout, regardless of the system definition or the configured DefaultHttpTimeout.
A slow or hung external system will block the calling Script Execution Actor far
longer than the operator configured, and the design's core error-handling guarantee
(timeout → transient classification) does not hold within the intended window.
There is also no Timeout field on ExternalSystemDefinition at all, so even a
correct implementation has nowhere to read the per-system value from — the entity is
missing the field the design requires.
Recommendation
Add a Timeout (TimeSpan) field to ExternalSystemDefinition and have
InvokeHttpAsync enforce it — either by setting client.Timeout via a typed/named
HttpClient registration, or by linking a CancellationTokenSource with the
per-system (or DefaultHttpTimeout) timeout to the supplied cancellationToken
before SendAsync. Ensure the resulting TaskCanceledException/TimeoutException
is classified as transient.
Resolution
Resolved 2026-05-16 (commit <pending>). InvokeHttpAsync now enforces a call
timeout: ExternalSystemClient takes an IOptions<ExternalSystemGatewayOptions> and
links a CancellationTokenSource(DefaultHttpTimeout) with the caller's token before
SendAsync and the response-body read, so the design's "timeout applies to the HTTP
request round-trip" guarantee now holds within the configured window (default 30s)
instead of HttpClient's default 100s. A timeout is reclassified as a
TransientExternalSystemException; a caller-initiated cancellation is distinguished
from a timeout and propagated as OperationCanceledException rather than being
swallowed as transient. Regression tests:
Call_SlowSystem_TimesOutAsTransientErrorWithinConfiguredWindow and
Call_CallerCancellation_IsNotMisreportedAsTimeout.
Note (partial scope): the per-system Timeout field on ExternalSystemDefinition
remains unimplemented — adding it requires a change to ScadaLink.Commons, which is
outside this module's edit scope. Until that entity field exists, the configured
DefaultHttpTimeout is the effective per-call limit for every system. A follow-up
against the Commons module should add the Timeout field and have InvokeHttpAsync
prefer it over the default. This is a tracked follow-up, not a regression.
ExternalSystemGateway-003 — CachedCall double-dispatches the HTTP request
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117 |
Description
CachedCallAsync first calls InvokeHttpAsync directly (line 86). On a
TransientExternalSystemException it then calls _storeAndForward.EnqueueAsync(...)
(line 109). StoreAndForwardService.EnqueueAsync is not a pure enqueue — it
"Attempts immediate delivery" by invoking the registered delivery handler
(StoreAndForwardService.cs:128-159). If a delivery handler for the ExternalSystem
category is registered (as finding 001 recommends), the HTTP request will be executed
a second time synchronously inside EnqueueAsync, immediately after the first
attempt failed. For a transient failure that is actually a slow/overloaded system,
this doubles the load and — critically — if the original request did reach the
external system, the immediate retry produces a duplicate delivery before the script
even returns, worsening the idempotency hazard the design doc explicitly warns about.
Recommendation
Decide on one dispatch path. Either (a) have CachedCall not pre-invoke
InvokeHttpAsync and instead let EnqueueAsync's immediate-delivery attempt be the
single first attempt (requires the handler to exist and to surface permanent vs
transient correctly); or (b) add an enqueue-only entry point to
StoreAndForwardService that skips the immediate-delivery attempt, and have
CachedCall use it after its own first attempt. Approach (a) is cleaner and removes
the duplicated logic.
Resolution
Resolved 2026-05-16 (commit <pending>). Re-triage: this finding was already fixed in
the codebase as a side effect of the ExternalSystemGateway-001 fix and is no longer
reproducible against the current source. StoreAndForwardService.EnqueueAsync gained an
attemptImmediateDelivery parameter (recommendation approach (b)), and
CachedCallAsync passes attemptImmediateDelivery: false after its own first HTTP
attempt — so EnqueueAsync buffers the message for the background retry sweep without
re-invoking the registered delivery handler, eliminating the duplicate dispatch. A
dedicated regression test, CachedCall_TransientFailure_DoesNotImmediatelyRedispatchViaRegisteredHandler,
was added in this module's test suite: it registers a counting delivery handler, drives
a CachedCall whose HTTP attempt fails transiently, and asserts the handler is invoked
zero times during enqueue. The test was verified to fail if attemptImmediateDelivery
is flipped back to true.
ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87 |
Description
CachedCallAsync and CachedWriteAsync pass the definition's MaxRetries /
RetryDelay to EnqueueAsync only when they are non-default
(MaxRetries > 0 ? ... : null, RetryDelay > TimeSpan.Zero ? ... : null), otherwise
falling back to the S&F defaults. The site-side repository that supplies these
definitions, SiteExternalSystemRepository.MapExternalSystem
(src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194), never
reads MaxRetries/RetryDelay from SQLite at all — the constructed entities always
have MaxRetries == 0 and RetryDelay == TimeSpan.Zero. As a result, at sites the
per-system retry settings the design doc requires are always discarded and the
global S&F defaults are silently used instead. The > 0 guard in the ESG also makes
a legitimately-configured MaxRetries of 0 ("never retry") indistinguishable from
"unset", so an operator cannot express "do not retry".
Recommendation
Within this module, drop the > 0 / > Zero guards and pass the definition values
through directly (or use nullable fields on the entity to distinguish "unset"). The
companion fix in SiteExternalSystemRepository to actually map the retry columns
should be tracked against the SiteRuntime module.
Resolution
Unresolved.
ExternalSystemGateway-005 — HttpRequestMessage and HttpResponseMessage are not disposed
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167 |
Description
InvokeHttpAsync creates an HttpRequestMessage (line 133) and receives an
HttpResponseMessage from SendAsync (line 155); neither is wrapped in a using nor
explicitly disposed. Both are IDisposable and own resources (the request's
StringContent, the response's content stream). Under the per-invocation call volume
of a busy site this produces avoidable pressure on the finalizer queue and can hold
socket/stream resources longer than necessary. The success path reads the content but
never disposes the response; the error path likewise reads errorBody and then throws
without disposing.
Recommendation
Wrap the request in using var request = ... and the response in
using var response = ... (or call Dispose() in a finally). Ensure disposal still
occurs on the exception paths.
Resolution
Unresolved.
ExternalSystemGateway-006 — BuildUrl ignores path templates and appends a trailing slash for empty paths
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196 |
Description
BuildUrl does baseUrl.TrimEnd('/') + "/" + path.TrimStart('/'). When method.Path
is empty (a method that targets the base URL itself), this still appends a /,
producing https://host/api/ which some servers treat as a different resource than
https://host/api. More importantly, the design doc shows method paths as templates
like /recipes/{id}, but BuildUrl performs no placeholder substitution — a {id}
token is sent literally in the URL and the corresponding parameter is instead appended
as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either
the design's path-template feature is unimplemented, or the doc is stale; in the
current code a method defined as /recipes/{id} will never produce a correct URL.
Recommendation
Decide whether path templating is in scope. If yes, implement {name} substitution
from parameters in BuildUrl and exclude substituted parameters from the query
string/body. If no, update the component design doc to remove the /recipes/{id}
example and state that paths are literal. Also avoid appending a trailing / when
path is empty.
Resolution
Unresolved.
ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177 |
Description
On a non-success HTTP response, the full response body is read into errorBody and
embedded verbatim into the exception message ($"HTTP {code} from {name}: {errorBody}"),
which then flows into ExternalCallResult.ErrorMessage and back to the calling script,
and into Site Event Logging. An external system error page can be arbitrarily large
(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There
is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry
and error string returned to scripts. There is also no content-type check before
treating the body as text.
Recommendation
Truncate errorBody to a bounded length (e.g. 1–2 KB) before embedding it, and
consider logging the full body separately at debug level rather than returning it to
the script. Optionally only include the body when the content type is textual.
Resolution
Unresolved.
ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30, src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159 |
Description
ErrorClassifier.IsTransient(Exception) returns true for TaskCanceledException
and OperationCanceledException. HttpClient.SendAsync throws TaskCanceledException
both when its internal timeout elapses and when the supplied CancellationToken is
cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting
down). Because InvokeHttpAsync's catch filter treats all of these as transient, a
caller-initiated cancellation during a CachedCall will be misclassified as a
transient failure and the message will be buffered for retry — work the caller
explicitly asked to abandon. For a Call, a shutdown-time cancellation is reported to
the script as a "Transient error" rather than an OperationCanceledException.
Recommendation
In InvokeHttpAsync, check cancellationToken.IsCancellationRequested first and
rethrow OperationCanceledException (or let it propagate) before applying transient
classification. Only treat a cancellation as a timeout when the supplied token is
not the one that was cancelled.
Resolution
Unresolved.
ExternalSystemGateway-009 — StoreAndForwardResult from EnqueueAsync is discarded; permanent failures during buffering are swallowed
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117 |
Description
CachedCallAsync assigns the result of _storeAndForward.EnqueueAsync(...) to
sfResult and then never reads it — it unconditionally returns
new ExternalCallResult(true, null, null, WasBuffered: true). EnqueueAsync can
return Success == false (a permanent failure encountered during its
immediate-delivery attempt — StoreAndForwardService.cs:142) or Buffered == false
(delivered immediately). In both cases the ESG still reports the call as buffered and
successful to the script. A permanent failure surfaced by the S&F immediate attempt is
therefore silently lost instead of being returned to the script as the design requires
("On permanent failure (HTTP 4xx), the error is returned synchronously").
Recommendation
Inspect sfResult: if Success == false return an error ExternalCallResult; set
WasBuffered from sfResult.Buffered rather than hard-coding true. (This finding is
partly subsumed by the dispatch redesign in finding 003.)
Resolution
Unresolved.
ExternalSystemGateway-010 — GetConnectionAsync leaks the SqlConnection when OpenAsync fails
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50 |
Description
GetConnectionAsync constructs new SqlConnection(...) and calls await connection.OpenAsync(...). If OpenAsync throws (unreachable server, bad
credentials, cancellation) the just-created SqlConnection instance is never disposed
— the exception propagates and the local reference is lost. While an unopened
SqlConnection is lightweight, over many failing calls this is an avoidable leak. The
design doc says Database.Connection() failures return an error to the script; the
current code lets a raw SqlException escape, which is acceptable, but the leak is
not.
Recommendation
Wrap the open in a try/catch that disposes the connection before rethrowing:
try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }.
Resolution
Unresolved.
ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:231-245, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:90-97 |
Description
ResolveSystemAndMethodAsync calls GetAllExternalSystemsAsync() and then
GetMethodsByExternalSystemIdAsync() and filters in memory on every single call;
ResolveConnectionAsync calls GetAllDatabaseConnectionsAsync() and filters in memory
on every cached write / connection request. At sites this hits the SQLite repository,
and SiteExternalSystemRepository re-reads and re-parses the methods JSON each time.
For a hot script path this is unnecessary repeated I/O and allocation. Definitions only
change on deployment, so they are eminently cacheable.
Recommendation
Add an in-memory cache of system/method/connection definitions keyed by name, invalidated on artifact deployment. Alternatively use a name-keyed repository lookup rather than fetch-all-then-filter.
Resolution
Unresolved.
ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; _logger is injected but unused
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22 |
Description
The design doc states permanent failures are "Logged to Site Event Logging", but
InvokeHttpAsync performs no logging on the permanent-failure path. In fact the
injected ILogger<ExternalSystemClient> and ILogger<DatabaseGateway> fields are
never used at all in either class. Either the logging is expected to happen in the
caller (Script Execution Actor) — in which case the design doc is imprecise about
where — or it is missing. Separately, IsTransient(HttpStatusCode) treats any
non-success, non-(5xx/408/429) status as permanent without an explicit comment, which
is a reasonable default but undocumented.
Recommendation
Add a _logger.LogWarning on the permanent-failure path (and a debug log on
transient), or clarify in the design doc that Site Event Logging capture is the
caller's responsibility and remove the unused _logger fields. Add a comment in
ErrorClassifier documenting the "default to permanent" behaviour.
Resolution
Unresolved.
ExternalSystemGateway-013 — MaxConcurrentConnectionsPerSystem and DefaultHttpTimeout options are defined but never used
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13 |
Description
ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem (default 10) and
DefaultHttpTimeout (default 30s) are bound from configuration but neither is read
anywhere. AddHttpClient() registers the default factory with no
ConfigurePrimaryHttpMessageHandler/SocketsHttpHandler MaxConnectionsPerServer and
no Timeout, so both options have no effect. An operator setting these values gets
them silently ignored — a misleading configuration surface (DefaultHttpTimeout is
also referenced by finding 002).
Recommendation
Either wire the options into a named/typed HttpClient registration (set
MaxConnectionsPerServer on the primary handler, set Timeout), or remove the unused
options to avoid implying behaviour that does not exist.
Resolution
Unresolved.
ExternalSystemGateway-014 — Cached-call buffering path and DatabaseGateway are untested
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1, (no DatabaseGatewayTests.cs) |
Description
ExternalSystemClientTests covers system/method not-found, success, transient 500 and
permanent 400 for CallAsync, plus CachedCall not-found and success. It does not
cover: the CachedCall transient-failure → S&F buffering branch (the most
behaviour-rich path, including the _storeAndForward == null fallback and WasBuffered
semantics), the CachedCall permanent-failure branch, connection-exception
classification (HttpRequestException thrown by the handler), BuildUrl query-string
construction, and ApplyAuth for the apikey/basic variants. There is no test file
for DatabaseGateway at all — GetConnectionAsync not-found, CachedWriteAsync
not-found, and the _storeAndForward == null guard are entirely uncovered. The
MockHttpMessageHandler also does not assert request URL/headers/body, so auth and
URL construction are unverified.
Recommendation
Add tests for the CachedCall transient/buffering paths (with a substituted S&F
service), DatabaseGateway not-found and null-S&F guards, and BuildUrl/ApplyAuth
by asserting on the captured HttpRequestMessage in the mock handler.
Resolution
Unresolved.