Files
scadalink-design/code-reviews/ExternalSystemGateway/findings.md
Joseph Doherty 977d7369a7 docs: add code review process and baseline review of all 19 modules
Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00

23 KiB
Raw Blame History

Code Review — ExternalSystemGateway

Field Value
Module src/ScadaLink.ExternalSystemGateway
Design doc docs/requirements/Component-ExternalSystemGateway.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 14

Summary

The External System Gateway is a small module (five source files plus options) that implements the HTTP/REST client (ExternalSystemClient), the database access surface (DatabaseGateway), and error classification (ErrorClassifier). The structure is clean and the dual call-mode semantics broadly match the design doc. However, the review surfaced several substantive problems that prevent the module from behaving as designed. The most serious is that no store-and-forward delivery handler is ever registered for the ExternalSystem or CachedDbWrite categories, so cached calls and cached writes are buffered but can never actually be delivered on retry — a silent data-loss path. Two further high-impact issues are that the per-system call timeout is never applied to the HTTP client (the design's central error-handling guarantee is absent), and that CachedCall double-dispatches the HTTP request because StoreAndForwardService.EnqueueAsync itself re-attempts immediate delivery, breaking the idempotency expectations. A cluster of medium issues concern resource leaks, classification gaps (cancellation conflation), and the dropped StoreAndForwardResult. Test coverage is thin — CachedCall transient/buffering paths and DatabaseGateway are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc requirements (timeout, retry settings) that are declared but not implemented.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009.
2 Akka.NET conventions No actors in this module; AddExternalSystemGatewayActors is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module.
3 Concurrency & thread safety Services are stateless and DI-scoped; ExternalCallResult.Response lazy-parse is not thread-safe but instances are single-use. No findings raised.
4 Error handling & resilience S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008.
5 Security Auth secrets logged-safe, but error bodies echoed verbatim — finding 007.
6 Performance & resource management HttpRequestMessage/HttpResponseMessage and failed SqlConnection not disposed; full repository scan per call — findings 005, 010, 011.
7 Design-document adherence Timeout, retry settings, audit logging gaps — findings 002, 004, 012.
8 Code organization & conventions Options class correctly owned by module; MaxConcurrentConnectionsPerSystem unused — finding 013.
9 Testing coverage CachedCall buffering and DatabaseGateway untested — finding 014.
10 Documentation & comments XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012.

Findings

ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered

Severity Critical
Category Error handling & resilience
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81

Description

CachedCallAsync and CachedWriteAsync enqueue messages under StoreAndForwardCategory.ExternalSystem and StoreAndForwardCategory.CachedDbWrite. StoreAndForwardService.RegisterDeliveryHandler is the only mechanism that lets the S&F engine actually deliver a buffered message, and a repository-wide search shows it is never called for either category anywhere in the codebase. Consequences:

  1. On a transient failure, EnqueueAsync falls through to the "No handler registered — buffer for later" branch (StoreAndForwardService.cs:163) and the message is persisted.
  2. During the retry sweep, AttemptDeliveryAsync (StoreAndForwardService.cs:201) logs "No delivery handler for category {Category}" and returns without ever removing or delivering the message.

The result is that every cached external call and cached DB write is silently buffered forever and never delivered — a data-loss path for the exact "deferred delivery is acceptable" use cases the design doc calls out (posting production data, quality reports). The script also receives WasBuffered: true / a successful CachedWriteAsync completion, so the failure is completely invisible.

Recommendation

Register delivery handlers for StoreAndForwardCategory.ExternalSystem and StoreAndForwardCategory.CachedDbWrite during host/site startup. The ExternalSystem handler should deserialize the payload, re-resolve the system/method, and re-invoke InvokeHttpAsync, returning true/false/throwing per the transient-vs-permanent contract EnqueueAsync expects. The CachedDbWrite handler should execute the SQL against the named connection. Add an integration test that buffers a message and verifies it is delivered by a retry sweep.

Resolution

Unresolved.

ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests

Severity High
Category Error handling & resilience
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13

Description

The design doc states each external system definition specifies a timeout that "applies to all method calls on that system" and "applies to the HTTP request round-trip", and ExternalSystemGatewayOptions.DefaultHttpTimeout exists as a fallback. In practice no timeout is ever configured. ServiceCollectionExtensions calls services.AddHttpClient() with no per-named-client configuration, and InvokeHttpAsync calls _httpClientFactory.CreateClient($"ExternalSystem_{system.Name}") without setting client.Timeout or passing a CancellationToken derived from a timeout. SendAsync is therefore subject only to HttpClient's default 100-second timeout, regardless of the system definition or the configured DefaultHttpTimeout. A slow or hung external system will block the calling Script Execution Actor far longer than the operator configured, and the design's core error-handling guarantee (timeout → transient classification) does not hold within the intended window.

There is also no Timeout field on ExternalSystemDefinition at all, so even a correct implementation has nowhere to read the per-system value from — the entity is missing the field the design requires.

Recommendation

Add a Timeout (TimeSpan) field to ExternalSystemDefinition and have InvokeHttpAsync enforce it — either by setting client.Timeout via a typed/named HttpClient registration, or by linking a CancellationTokenSource with the per-system (or DefaultHttpTimeout) timeout to the supplied cancellationToken before SendAsync. Ensure the resulting TaskCanceledException/TimeoutException is classified as transient.

Resolution

Unresolved.

ExternalSystemGateway-003 — CachedCall double-dispatches the HTTP request

Severity High
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117

Description

CachedCallAsync first calls InvokeHttpAsync directly (line 86). On a TransientExternalSystemException it then calls _storeAndForward.EnqueueAsync(...) (line 109). StoreAndForwardService.EnqueueAsync is not a pure enqueue — it "Attempts immediate delivery" by invoking the registered delivery handler (StoreAndForwardService.cs:128-159). If a delivery handler for the ExternalSystem category is registered (as finding 001 recommends), the HTTP request will be executed a second time synchronously inside EnqueueAsync, immediately after the first attempt failed. For a transient failure that is actually a slow/overloaded system, this doubles the load and — critically — if the original request did reach the external system, the immediate retry produces a duplicate delivery before the script even returns, worsening the idempotency hazard the design doc explicitly warns about.

Recommendation

Decide on one dispatch path. Either (a) have CachedCall not pre-invoke InvokeHttpAsync and instead let EnqueueAsync's immediate-delivery attempt be the single first attempt (requires the handler to exist and to surface permanent vs transient correctly); or (b) add an enqueue-only entry point to StoreAndForwardService that skips the immediate-delivery attempt, and have CachedCall use it after its own first attempt. Approach (a) is cleaner and removes the duplicated logic.

Resolution

Unresolved.

ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes

Severity Medium
Category Design-document adherence
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87

Description

CachedCallAsync and CachedWriteAsync pass the definition's MaxRetries / RetryDelay to EnqueueAsync only when they are non-default (MaxRetries > 0 ? ... : null, RetryDelay > TimeSpan.Zero ? ... : null), otherwise falling back to the S&F defaults. The site-side repository that supplies these definitions, SiteExternalSystemRepository.MapExternalSystem (src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194), never reads MaxRetries/RetryDelay from SQLite at all — the constructed entities always have MaxRetries == 0 and RetryDelay == TimeSpan.Zero. As a result, at sites the per-system retry settings the design doc requires are always discarded and the global S&F defaults are silently used instead. The > 0 guard in the ESG also makes a legitimately-configured MaxRetries of 0 ("never retry") indistinguishable from "unset", so an operator cannot express "do not retry".

Recommendation

Within this module, drop the > 0 / > Zero guards and pass the definition values through directly (or use nullable fields on the entity to distinguish "unset"). The companion fix in SiteExternalSystemRepository to actually map the retry columns should be tracked against the SiteRuntime module.

Resolution

Unresolved.

ExternalSystemGateway-005 — HttpRequestMessage and HttpResponseMessage are not disposed

Severity Medium
Category Performance & resource management
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167

Description

InvokeHttpAsync creates an HttpRequestMessage (line 133) and receives an HttpResponseMessage from SendAsync (line 155); neither is wrapped in a using nor explicitly disposed. Both are IDisposable and own resources (the request's StringContent, the response's content stream). Under the per-invocation call volume of a busy site this produces avoidable pressure on the finalizer queue and can hold socket/stream resources longer than necessary. The success path reads the content but never disposes the response; the error path likewise reads errorBody and then throws without disposing.

Recommendation

Wrap the request in using var request = ... and the response in using var response = ... (or call Dispose() in a finally). Ensure disposal still occurs on the exception paths.

Resolution

Unresolved.

ExternalSystemGateway-006 — BuildUrl ignores path templates and appends a trailing slash for empty paths

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196

Description

BuildUrl does baseUrl.TrimEnd('/') + "/" + path.TrimStart('/'). When method.Path is empty (a method that targets the base URL itself), this still appends a /, producing https://host/api/ which some servers treat as a different resource than https://host/api. More importantly, the design doc shows method paths as templates like /recipes/{id}, but BuildUrl performs no placeholder substitution — a {id} token is sent literally in the URL and the corresponding parameter is instead appended as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either the design's path-template feature is unimplemented, or the doc is stale; in the current code a method defined as /recipes/{id} will never produce a correct URL.

Recommendation

Decide whether path templating is in scope. If yes, implement {name} substitution from parameters in BuildUrl and exclude substituted parameters from the query string/body. If no, update the component design doc to remove the /recipes/{id} example and state that paths are literal. Also avoid appending a trailing / when path is empty.

Resolution

Unresolved.

ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages

Severity Medium
Category Security
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177

Description

On a non-success HTTP response, the full response body is read into errorBody and embedded verbatim into the exception message ($"HTTP {code} from {name}: {errorBody}"), which then flows into ExternalCallResult.ErrorMessage and back to the calling script, and into Site Event Logging. An external system error page can be arbitrarily large (an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry and error string returned to scripts. There is also no content-type check before treating the body as text.

Recommendation

Truncate errorBody to a bounded length (e.g. 12 KB) before embedding it, and consider logging the full body separately at debug level rather than returning it to the script. Optionally only include the body when the content type is textual.

Resolution

Unresolved.

ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure

Severity Medium
Category Error handling & resilience
Status Open
Location src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30, src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159

Description

ErrorClassifier.IsTransient(Exception) returns true for TaskCanceledException and OperationCanceledException. HttpClient.SendAsync throws TaskCanceledException both when its internal timeout elapses and when the supplied CancellationToken is cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting down). Because InvokeHttpAsync's catch filter treats all of these as transient, a caller-initiated cancellation during a CachedCall will be misclassified as a transient failure and the message will be buffered for retry — work the caller explicitly asked to abandon. For a Call, a shutdown-time cancellation is reported to the script as a "Transient error" rather than an OperationCanceledException.

Recommendation

In InvokeHttpAsync, check cancellationToken.IsCancellationRequested first and rethrow OperationCanceledException (or let it propagate) before applying transient classification. Only treat a cancellation as a timeout when the supplied token is not the one that was cancelled.

Resolution

Unresolved.

ExternalSystemGateway-009 — StoreAndForwardResult from EnqueueAsync is discarded; permanent failures during buffering are swallowed

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117

Description

CachedCallAsync assigns the result of _storeAndForward.EnqueueAsync(...) to sfResult and then never reads it — it unconditionally returns new ExternalCallResult(true, null, null, WasBuffered: true). EnqueueAsync can return Success == false (a permanent failure encountered during its immediate-delivery attempt — StoreAndForwardService.cs:142) or Buffered == false (delivered immediately). In both cases the ESG still reports the call as buffered and successful to the script. A permanent failure surfaced by the S&F immediate attempt is therefore silently lost instead of being returned to the script as the design requires ("On permanent failure (HTTP 4xx), the error is returned synchronously").

Recommendation

Inspect sfResult: if Success == false return an error ExternalCallResult; set WasBuffered from sfResult.Buffered rather than hard-coding true. (This finding is partly subsumed by the dispatch redesign in finding 003.)

Resolution

Unresolved.

ExternalSystemGateway-010 — GetConnectionAsync leaks the SqlConnection when OpenAsync fails

Severity Medium
Category Performance & resource management
Status Open
Location src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50

Description

GetConnectionAsync constructs new SqlConnection(...) and calls await connection.OpenAsync(...). If OpenAsync throws (unreachable server, bad credentials, cancellation) the just-created SqlConnection instance is never disposed — the exception propagates and the local reference is lost. While an unopened SqlConnection is lightweight, over many failing calls this is an avoidable leak. The design doc says Database.Connection() failures return an error to the script; the current code lets a raw SqlException escape, which is acceptable, but the leak is not.

Recommendation

Wrap the open in a try/catch that disposes the connection before rethrowing: try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }.

Resolution

Unresolved.

ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods

Severity Low
Category Performance & resource management
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:231-245, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:90-97

Description

ResolveSystemAndMethodAsync calls GetAllExternalSystemsAsync() and then GetMethodsByExternalSystemIdAsync() and filters in memory on every single call; ResolveConnectionAsync calls GetAllDatabaseConnectionsAsync() and filters in memory on every cached write / connection request. At sites this hits the SQLite repository, and SiteExternalSystemRepository re-reads and re-parses the methods JSON each time. For a hot script path this is unnecessary repeated I/O and allocation. Definitions only change on deployment, so they are eminently cacheable.

Recommendation

Add an in-memory cache of system/method/connection definitions keyed by name, invalidated on artifact deployment. Alternatively use a name-keyed repository lookup rather than fetch-all-then-filter.

Resolution

Unresolved.

ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; _logger is injected but unused

Severity Low
Category Design-document adherence
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22

Description

The design doc states permanent failures are "Logged to Site Event Logging", but InvokeHttpAsync performs no logging on the permanent-failure path. In fact the injected ILogger<ExternalSystemClient> and ILogger<DatabaseGateway> fields are never used at all in either class. Either the logging is expected to happen in the caller (Script Execution Actor) — in which case the design doc is imprecise about where — or it is missing. Separately, IsTransient(HttpStatusCode) treats any non-success, non-(5xx/408/429) status as permanent without an explicit comment, which is a reasonable default but undocumented.

Recommendation

Add a _logger.LogWarning on the permanent-failure path (and a debug log on transient), or clarify in the design doc that Site Event Logging capture is the caller's responsibility and remove the unused _logger fields. Add a comment in ErrorClassifier documenting the "default to permanent" behaviour.

Resolution

Unresolved.

ExternalSystemGateway-013 — MaxConcurrentConnectionsPerSystem and DefaultHttpTimeout options are defined but never used

Severity Low
Category Code organization & conventions
Status Open
Location src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13

Description

ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem (default 10) and DefaultHttpTimeout (default 30s) are bound from configuration but neither is read anywhere. AddHttpClient() registers the default factory with no ConfigurePrimaryHttpMessageHandler/SocketsHttpHandler MaxConnectionsPerServer and no Timeout, so both options have no effect. An operator setting these values gets them silently ignored — a misleading configuration surface (DefaultHttpTimeout is also referenced by finding 002).

Recommendation

Either wire the options into a named/typed HttpClient registration (set MaxConnectionsPerServer on the primary handler, set Timeout), or remove the unused options to avoid implying behaviour that does not exist.

Resolution

Unresolved.

ExternalSystemGateway-014 — Cached-call buffering path and DatabaseGateway are untested

Severity Low
Category Testing coverage
Status Open
Location tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1, (no DatabaseGatewayTests.cs)

Description

ExternalSystemClientTests covers system/method not-found, success, transient 500 and permanent 400 for CallAsync, plus CachedCall not-found and success. It does not cover: the CachedCall transient-failure → S&F buffering branch (the most behaviour-rich path, including the _storeAndForward == null fallback and WasBuffered semantics), the CachedCall permanent-failure branch, connection-exception classification (HttpRequestException thrown by the handler), BuildUrl query-string construction, and ApplyAuth for the apikey/basic variants. There is no test file for DatabaseGateway at all — GetConnectionAsync not-found, CachedWriteAsync not-found, and the _storeAndForward == null guard are entirely uncovered. The MockHttpMessageHandler also does not assert request URL/headers/body, so auth and URL construction are unverified.

Recommendation

Add tests for the CachedCall transient/buffering paths (with a substituted S&F service), DatabaseGateway not-found and null-S&F guards, and BuildUrl/ApplyAuth by asserting on the captured HttpRequestMessage in the mock handler.

Resolution

Unresolved.