Final themed batch. 5 well-localised correctness fixes. Serialisation precision: - ESG-020: DatabaseGateway.JsonElementToParameterValue probes TryGetInt64 → TryGetDecimal → GetDouble, so a script's high-precision decimal SQL parameter survives the cached-write retry round-trip without silent precision loss. 3 new regression tests. Template engine correctness: - TE-018: DiffService gains ComputeConnectionsDiff over FlattenedConfiguration.Connections, mirroring the existing entity-diff shape and pairing with the Theme 1 TE-017 hash-coverage fix. A ConfigurationDiff record extension in Commons is flagged as a follow-up. - TE-019: TemplateResolver.BuildInheritanceChain now walks via the int? ParentTemplateId directly — only null means "no parent". A real Id of 0 (the prior special-cased sentinel) now walks the chain like any other node, matching the TemplateEngine-013 CycleDetector fix. Regression of TE-013 closed. - TE-020: All 5 Create* paths in TemplateService + SharedScriptService re-ordered to save-first → log-with-real-Id → save-audit (matching the InstanceService pattern). Create* audit rows no longer carry a literal "0" EntityId. Doc deferral: - Transport-012: Component-Transport.md §Audit Trail now spells out that the BundleImportId repository filter IS wired (in CentralUiRepository), but the Audit-Log-Viewer UI dropdown + summary-row hyperlink are a deferred CentralUI follow-up. CLI workaround documented (audit query --bundle-import-id). 11+ new regression tests (3 ESG, 4 DiffService, 3 TemplateResolver, 4 TemplateService, 1 SharedScriptService). Build clean; ESG 72/72, TemplateEngine 324/324. README regenerated: 1 pending of 481 total. Session-to-date: 135 of 136 originally-open Theme findings closed across 10 themes in 10 commits.
79 KiB
Code Review — ExternalSystemGateway
| Field | Value |
|---|---|
| Module | src/ScadaLink.ExternalSystemGateway |
| Design doc | docs/requirements/Component-ExternalSystemGateway.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 0 |
Summary
The External System Gateway is a small module (five source files plus options) that
implements the HTTP/REST client (ExternalSystemClient), the database access surface
(DatabaseGateway), and error classification (ErrorClassifier). The structure is
clean and the dual call-mode semantics broadly match the design doc. However, the
review surfaced several substantive problems that prevent the module from behaving as
designed. The most serious is that no store-and-forward delivery handler is ever
registered for the ExternalSystem or CachedDbWrite categories, so cached calls
and cached writes are buffered but can never actually be delivered on retry — a silent
data-loss path. Two further high-impact issues are that the per-system call timeout
is never applied to the HTTP client (the design's central error-handling guarantee
is absent), and that CachedCall double-dispatches the HTTP request because
StoreAndForwardService.EnqueueAsync itself re-attempts immediate delivery, breaking
the idempotency expectations. A cluster of medium issues concern resource leaks,
classification gaps (cancellation conflation), and the dropped StoreAndForwardResult.
Test coverage is thin — CachedCall transient/buffering paths and DatabaseGateway
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
requirements (timeout, retry settings) that are declared but not implemented.
Re-review 2026-05-17 (commit 39d737e)
All fourteen prior findings remain Resolved; the resolutions for findings 001–014
were spot-checked against the current source and hold. The re-review walked the full
10-category checklist again and surfaced three new findings. The most serious
(ExternalSystemGateway-015, High) is a regression introduced by the
ExternalSystemGateway-004 resolution: CachedCall/CachedWrite now pass a
per-system/per-connection MaxRetries of 0 through verbatim, but
StoreAndForwardService.RetryMessageAsync interprets a stored MaxRetries of 0 as
retry forever, not "never retry" — so the very 0 the ESG-004 fix claims to
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
assert the broken behaviour. ExternalSystemGateway-016 (Medium) is that the
ExternalSystemGateway-013 resolution used ConfigureHttpClientDefaults, which is a
process-global registration — it forces a SocketsHttpHandler (capped at the ESG
option) onto every HttpClient in the host, including the Notification Service's
OAuth2 token client, not just the gateway's per-system clients. ExternalSystemGateway-017
(Low) is a trailing-? URL nit when a GET method's parameters are all null. Theme:
both substantive findings are second-order defects in earlier fixes — the earlier
resolutions did not verify the downstream contract of the S&F engine they integrate
with.
Re-review 2026-05-28 (commit 1eb6e97)
All seventeen prior findings (001–017) remain Resolved; spot-checks against the
current source confirm the fixes still hold. Between 39d737e and this re-review the
only source changes to the module are the documentation-only commit 1eb6e97 (XML
doc additions) and the executionId / sourceScript / parentExecutionId plumbing
threaded through CachedCallAsync / CachedWriteAsync to the S&F enqueue (Audit Log
#23 Tasks 4/6). The re-review walked the full 10-category checklist again and
surfaced six new findings, none Critical. The most serious
(ExternalSystemGateway-018, High) is that DeliverBufferedAsync on both
ExternalSystemClient and DatabaseGateway lets a JsonException from
JsonSerializer.Deserialize propagate out of the delivery handler — the S&F engine
treats any thrown exception as a transient retry, so a corrupted or
schema-incompatible buffered row becomes a permanent poison message that is retried
on every sweep forever (the same retry-forever class of hazard -015 already
addressed for a different cause). ExternalSystemGateway-019 (Medium) is that
HttpClient.Timeout is never set, so any operator-configured DefaultHttpTimeout
greater than 100s is silently clipped by HttpClient's built-in 100s default and the
gateway's "timeout applies to the HTTP request round-trip" guarantee no longer
holds — a partial reopen of the -002 contract for the long-timeout case.
ExternalSystemGateway-020 (Medium) is a silent precision-loss bug in the cached-DB-write
retry path: JsonElementToParameterValue collapses any JSON number that is not
Int64-convertible to double, so a script's decimal SQL parameter is downcast on
retry and only on retry. The remaining three (-021/-022/-023, Low) are an
unauthenticated-by-default ApplyAuth for unknown AuthType / malformed Basic config,
runtime-only HTTP-verb validation, and an undocumented PATCH HTTP method (code vs
design-doc drift). Theme: every new finding is in a code path that was added or
touched by the earlier fix bundle but whose error-propagation contract was not
verified end-to-end against the S&F engine or the design doc.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: MaxRetries == 0 retry-forever vs never-retry contradiction (015); trailing-? URL nit (017). |
| 2 | Akka.NET conventions | ☑ | No actors in this module; AddExternalSystemGatewayActors is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
| 8 | Code organization & conventions | ☑ | Prior: MaxConcurrentConnectionsPerSystem wiring — 013. Re-review: that wiring uses process-global ConfigureHttpClientDefaults, leaking the ESG cap onto every host HttpClient — finding 016. |
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the ZeroMaxRetries... tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
| 10 | Documentation & comments | ☑ | Inline comments at ExternalSystemClient.cs:118-119 / DatabaseGateway.cs:99-101 assert a "never retry" semantic that the code does not deliver — see finding 015. |
Re-review (2026-05-28, 1eb6e97):
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | JsonException not caught in either DeliverBufferedAsync, so a corrupt buffered payload becomes a permanent poison-message retried forever — finding 018. JsonElementToParameterValue collapses a non-Int64 number to double, silently losing precision for decimal SQL parameters on cached-write retry — finding 020. new HttpMethod(method.HttpMethod) accepts any string at runtime, so an invalid HTTP verb is only diagnosed at call time — finding 022. |
| 2 | Akka.NET conventions | ☑ | Still no actors in this module; AddExternalSystemGatewayActors remains a no-op. The cached-call lifecycle/audit emission lives in ScriptRuntimeContext / CachedCallTelemetryForwarder (SiteRuntime / AuditLog), not here, and that boundary is correct. No issues found. |
| 3 | Concurrency & thread safety | ☑ | Services are still stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. The added executionId / sourceScript / parentExecutionId plumbing flows through method arguments only — no shared state introduced. No findings. |
| 4 | Error handling & resilience | ☑ | The poison-payload retry-forever path is the headline resilience issue (finding 018). HttpClient.Timeout not being set leaves the gateway's per-call round-trip cap clipped to the framework's 100s default whenever the configured DefaultHttpTimeout is larger — finding 019 (partial reopen of the -002 contract). |
| 5 | Security | ☑ | Auth secrets still never logged; error bodies still truncated. ApplyAuth is silent on unknown AuthType / empty AuthConfiguration / malformed Basic config — finding 021 (fail-open is a real but bounded risk; recorded Low because misconfiguration is the precondition). Connection-string handling in DatabaseGateway reads from the entity verbatim and never logs it. |
| 6 | Performance & resource management | ☑ | Disposal paths from findings 005/010 still hold. The IHttpClientFactory name-keyed-options registration (finding 016 fix) creates a fresh SocketsHttpHandler per primary-handler build — acceptable because IHttpClientFactory recycles handlers. No new findings. |
| 7 | Design-document adherence | ☑ | The design doc enumerates GET/POST/PUT/DELETE but the code also serializes a body for PATCH (and accepts arbitrary HTTP verbs at runtime) — finding 023 (drift to be reconciled). The per-call timeout guarantee is partially defeated by the unset HttpClient.Timeout for option values > 100s — finding 019. |
| 8 | Code organization & conventions | ☑ | The -016 fix replaced ConfigureHttpClientDefaults with a scoped IConfigureNamedOptions<HttpClientFactoryOptions> — verified clean, no new conventions issue. internal virtual CreateConnection (DatabaseGateway) and internal InvokeHttpAsync (ExternalSystemClient) are exposed via InternalsVisibleTo for tests — acceptable. No new findings. |
| 9 | Testing coverage | ☑ | The JsonException deserialization path for DeliverBufferedAsync is untested; the JsonElementToParameterValue double-downcast path is untested; ApplyAuth's unknown-AuthType / empty-config / malformed-Basic branches are untested. Recorded under findings 018 / 020 / 021 rather than a standalone coverage finding. |
| 10 | Documentation & comments | ☑ | XML doc additions in 1eb6e97 are accurate and consistent. PATCH support is undocumented in the design doc (finding 023). The inline ExternalSystemGateway-015 block-comment in CachedCallAsync (lines 126–133) and the equivalent in DatabaseGateway.cs:106–113 now correctly describe the "treat 0 as unset" semantics. |
Findings
ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81 |
Description
CachedCallAsync and CachedWriteAsync enqueue messages under
StoreAndForwardCategory.ExternalSystem and StoreAndForwardCategory.CachedDbWrite.
StoreAndForwardService.RegisterDeliveryHandler is the only mechanism that lets the
S&F engine actually deliver a buffered message, and a repository-wide search shows it
is never called for either category anywhere in the codebase. Consequences:
- On a transient failure,
EnqueueAsyncfalls through to the "No handler registered — buffer for later" branch (StoreAndForwardService.cs:163) and the message is persisted. - During the retry sweep,
AttemptDeliveryAsync(StoreAndForwardService.cs:201) logs"No delivery handler for category {Category}"and returns without ever removing or delivering the message.
The result is that every cached external call and cached DB write is silently
buffered forever and never delivered — a data-loss path for the exact "deferred
delivery is acceptable" use cases the design doc calls out (posting production data,
quality reports). The script also receives WasBuffered: true / a successful
CachedWriteAsync completion, so the failure is completely invisible.
Recommendation
Register delivery handlers for StoreAndForwardCategory.ExternalSystem and
StoreAndForwardCategory.CachedDbWrite during host/site startup. The ExternalSystem
handler should deserialize the payload, re-resolve the system/method, and re-invoke
InvokeHttpAsync, returning true/false/throwing per the transient-vs-permanent
contract EnqueueAsync expects. The CachedDbWrite handler should execute the SQL
against the named connection. Add an integration test that buffers a message and
verifies it is delivered by a retry sweep.
Resolution
Resolved 2026-05-16. Delivery handlers for StoreAndForwardCategory.ExternalSystem and
CachedDbWrite are now registered at site startup in AkkaHostedService, after
StoreAndForwardService.StartAsync(). Each handler resolves its consumer in a fresh DI
scope and calls a new DeliverBufferedAsync: ExternalSystemClient.DeliverBufferedAsync
re-resolves the system/method and re-invokes InvokeHttpAsync, and
DatabaseGateway.DeliverBufferedAsync executes the buffered SQL — each returning true
on success, false (park) when the target no longer exists or fails permanently, and
throwing on transient failure so the engine retries. EnqueueAsync gained an
attemptImmediateDelivery parameter; CachedCallAsync passes false so registering the
handler does not dispatch the request twice (the double-dispatch noted in
ExternalSystemGateway-003). Regression tests cover the success, target-removed and
transient-retry paths. Fixed by the commit whose message references
ExternalSystemGateway-001.
ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13 |
Description
The design doc states each external system definition specifies a timeout that
"applies to all method calls on that system" and "applies to the HTTP request
round-trip", and ExternalSystemGatewayOptions.DefaultHttpTimeout exists as a
fallback. In practice no timeout is ever configured. ServiceCollectionExtensions
calls services.AddHttpClient() with no per-named-client configuration, and
InvokeHttpAsync calls _httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")
without setting client.Timeout or passing a CancellationToken derived from a
timeout. SendAsync is therefore subject only to HttpClient's default 100-second
timeout, regardless of the system definition or the configured DefaultHttpTimeout.
A slow or hung external system will block the calling Script Execution Actor far
longer than the operator configured, and the design's core error-handling guarantee
(timeout → transient classification) does not hold within the intended window.
There is also no Timeout field on ExternalSystemDefinition at all, so even a
correct implementation has nowhere to read the per-system value from — the entity is
missing the field the design requires.
Recommendation
Add a Timeout (TimeSpan) field to ExternalSystemDefinition and have
InvokeHttpAsync enforce it — either by setting client.Timeout via a typed/named
HttpClient registration, or by linking a CancellationTokenSource with the
per-system (or DefaultHttpTimeout) timeout to the supplied cancellationToken
before SendAsync. Ensure the resulting TaskCanceledException/TimeoutException
is classified as transient.
Resolution
Resolved 2026-05-16 (commit <pending>). InvokeHttpAsync now enforces a call
timeout: ExternalSystemClient takes an IOptions<ExternalSystemGatewayOptions> and
links a CancellationTokenSource(DefaultHttpTimeout) with the caller's token before
SendAsync and the response-body read, so the design's "timeout applies to the HTTP
request round-trip" guarantee now holds within the configured window (default 30s)
instead of HttpClient's default 100s. A timeout is reclassified as a
TransientExternalSystemException; a caller-initiated cancellation is distinguished
from a timeout and propagated as OperationCanceledException rather than being
swallowed as transient. Regression tests:
Call_SlowSystem_TimesOutAsTransientErrorWithinConfiguredWindow and
Call_CallerCancellation_IsNotMisreportedAsTimeout.
Note (partial scope): the per-system Timeout field on ExternalSystemDefinition
remains unimplemented — adding it requires a change to ScadaLink.Commons, which is
outside this module's edit scope. Until that entity field exists, the configured
DefaultHttpTimeout is the effective per-call limit for every system. A follow-up
against the Commons module should add the Timeout field and have InvokeHttpAsync
prefer it over the default. This is a tracked follow-up, not a regression.
ExternalSystemGateway-003 — CachedCall double-dispatches the HTTP request
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117 |
Description
CachedCallAsync first calls InvokeHttpAsync directly (line 86). On a
TransientExternalSystemException it then calls _storeAndForward.EnqueueAsync(...)
(line 109). StoreAndForwardService.EnqueueAsync is not a pure enqueue — it
"Attempts immediate delivery" by invoking the registered delivery handler
(StoreAndForwardService.cs:128-159). If a delivery handler for the ExternalSystem
category is registered (as finding 001 recommends), the HTTP request will be executed
a second time synchronously inside EnqueueAsync, immediately after the first
attempt failed. For a transient failure that is actually a slow/overloaded system,
this doubles the load and — critically — if the original request did reach the
external system, the immediate retry produces a duplicate delivery before the script
even returns, worsening the idempotency hazard the design doc explicitly warns about.
Recommendation
Decide on one dispatch path. Either (a) have CachedCall not pre-invoke
InvokeHttpAsync and instead let EnqueueAsync's immediate-delivery attempt be the
single first attempt (requires the handler to exist and to surface permanent vs
transient correctly); or (b) add an enqueue-only entry point to
StoreAndForwardService that skips the immediate-delivery attempt, and have
CachedCall use it after its own first attempt. Approach (a) is cleaner and removes
the duplicated logic.
Resolution
Resolved 2026-05-16 (commit <pending>). Re-triage: this finding was already fixed in
the codebase as a side effect of the ExternalSystemGateway-001 fix and is no longer
reproducible against the current source. StoreAndForwardService.EnqueueAsync gained an
attemptImmediateDelivery parameter (recommendation approach (b)), and
CachedCallAsync passes attemptImmediateDelivery: false after its own first HTTP
attempt — so EnqueueAsync buffers the message for the background retry sweep without
re-invoking the registered delivery handler, eliminating the duplicate dispatch. A
dedicated regression test, CachedCall_TransientFailure_DoesNotImmediatelyRedispatchViaRegisteredHandler,
was added in this module's test suite: it registers a counting delivery handler, drives
a CachedCall whose HTTP attempt fails transiently, and asserts the handler is invoked
zero times during enqueue. The test was verified to fail if attemptImmediateDelivery
is flipped back to true.
ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87 |
Description
CachedCallAsync and CachedWriteAsync pass the definition's MaxRetries /
RetryDelay to EnqueueAsync only when they are non-default
(MaxRetries > 0 ? ... : null, RetryDelay > TimeSpan.Zero ? ... : null), otherwise
falling back to the S&F defaults. The site-side repository that supplies these
definitions, SiteExternalSystemRepository.MapExternalSystem
(src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194), never
reads MaxRetries/RetryDelay from SQLite at all — the constructed entities always
have MaxRetries == 0 and RetryDelay == TimeSpan.Zero. As a result, at sites the
per-system retry settings the design doc requires are always discarded and the
global S&F defaults are silently used instead. The > 0 guard in the ESG also makes
a legitimately-configured MaxRetries of 0 ("never retry") indistinguishable from
"unset", so an operator cannot express "do not retry".
Recommendation
Within this module, drop the > 0 / > Zero guards and pass the definition values
through directly (or use nullable fields on the entity to distinguish "unset"). The
companion fix in SiteExternalSystemRepository to actually map the retry columns
should be tracked against the SiteRuntime module.
Resolution
Resolved 2026-05-16 (commit pending). CachedCallAsync and CachedWriteAsync now pass
the definition's MaxRetries to EnqueueAsync verbatim — the > 0 guard is dropped, so
a legitimately-configured MaxRetries of 0 ("never retry") is honoured instead of being
collapsed to the S&F default. The RetryDelay > TimeSpan.Zero guard is deliberately
kept: TimeSpan.Zero is the entity default for an unconfigured field and a literal
zero-delay retry loop is not a valid configuration, so falling back to the S&F default
interval for an unset delay is correct (only MaxRetries == 0 is a meaningful operator
choice). Regression test CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset
buffers a transient failure and asserts the buffered message carries MaxRetries == 0
rather than the S&F default; CachedCall_TransientFailure_BuffersWithSystemRetrySettings
additionally covers a non-default settings pass-through. The companion fix in
SiteExternalSystemRepository.MapExternalSystem to actually read the MaxRetries /
RetryDelay columns from SQLite remains a tracked follow-up against the SiteRuntime
module (outside this module's edit scope) — until then, sites still supply
MaxRetries == 0, which this fix now correctly honours as "never retry".
ExternalSystemGateway-005 — HttpRequestMessage and HttpResponseMessage are not disposed
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167 |
Description
InvokeHttpAsync creates an HttpRequestMessage (line 133) and receives an
HttpResponseMessage from SendAsync (line 155); neither is wrapped in a using nor
explicitly disposed. Both are IDisposable and own resources (the request's
StringContent, the response's content stream). Under the per-invocation call volume
of a busy site this produces avoidable pressure on the finalizer queue and can hold
socket/stream resources longer than necessary. The success path reads the content but
never disposes the response; the error path likewise reads errorBody and then throws
without disposing.
Recommendation
Wrap the request in using var request = ... and the response in
using var response = ... (or call Dispose() in a finally). Ensure disposal still
occurs on the exception paths.
Resolution
Resolved 2026-05-16 (commit pending). InvokeHttpAsync now declares the request as
using var request and wraps all response handling in a using (response) block, so
both IDisposable instances (and the request's StringContent / the response content
stream) are released on the success path and on the permanent/transient
exception paths. Regression tests Call_SuccessfulHttp_DisposesRequestAndResponse and
Call_PermanentFailure_StillDisposesRequestAndResponse use a disposal-tracking
HttpMessageHandler/HttpContent and assert both the request and the response content
are disposed; both were verified to fail before the using wrappers were added.
ExternalSystemGateway-006 — BuildUrl ignores path templates and appends a trailing slash for empty paths
| Severity | Medium — partially re-triaged: trailing-slash bug fixed; path-templating sub-issue is a design decision (see Resolution) |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196 |
Description
BuildUrl does baseUrl.TrimEnd('/') + "/" + path.TrimStart('/'). When method.Path
is empty (a method that targets the base URL itself), this still appends a /,
producing https://host/api/ which some servers treat as a different resource than
https://host/api. More importantly, the design doc shows method paths as templates
like /recipes/{id}, but BuildUrl performs no placeholder substitution — a {id}
token is sent literally in the URL and the corresponding parameter is instead appended
as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either
the design's path-template feature is unimplemented, or the doc is stale; in the
current code a method defined as /recipes/{id} will never produce a correct URL.
Recommendation
Decide whether path templating is in scope. If yes, implement {name} substitution
from parameters in BuildUrl and exclude substituted parameters from the query
string/body. If no, update the component design doc to remove the /recipes/{id}
example and state that paths are literal. Also avoid appending a trailing / when
path is empty.
Resolution
Resolved 2026-05-16 (commit pending). The trailing-slash bug is fixed: BuildUrl
now appends a /-joined path segment only when the method's path is non-empty after
trimming, so a method targeting the base URL itself produces https://host/api rather
than https://host/api/. Regression tests Call_MethodWithEmptyPath_DoesNotAppendTrailingSlash
and Call_MethodWithPath_BuildsExpectedUrl (asserting on the captured request URI)
cover the empty-path and normal-path cases; the empty-path test was verified to fail
before the fix.
Re-triage of the path-templating sub-issue ({id} placeholder substitution): this
is a genuine design decision, not a code bug, and it requires editing the component
design doc — both outside this module's edit scope (src/, tests/, this file only).
The current code treats method paths as literal strings and routes parameters to the
query string (GET/DELETE) or JSON body (POST/PUT); a method authored as /recipes/{id}
sends the {id} token verbatim. Tracked follow-up / surfaced design question: the
design owner must decide whether path templating is in scope — if yes, implement
{name} substitution in BuildUrl and exclude substituted params from the
query/body; if no, the Component-ExternalSystemGateway.md /recipes/{id} example must
be changed to a literal path. The trailing-slash defect (the concrete correctness bug
in this finding) is fully resolved.
ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
| Severity | Medium |
| Category | Security |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177 |
Description
On a non-success HTTP response, the full response body is read into errorBody and
embedded verbatim into the exception message ($"HTTP {code} from {name}: {errorBody}"),
which then flows into ExternalCallResult.ErrorMessage and back to the calling script,
and into Site Event Logging. An external system error page can be arbitrarily large
(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There
is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry
and error string returned to scripts. There is also no content-type check before
treating the body as text.
Recommendation
Truncate errorBody to a bounded length (e.g. 1–2 KB) before embedding it, and
consider logging the full body separately at debug level rather than returning it to
the script. Optionally only include the body when the content type is textual.
Resolution
Resolved 2026-05-16 (commit pending). InvokeHttpAsync now truncates the external error
response body to MaxErrorBodyChars (2048) via a Truncate helper before embedding it
into the transient/permanent exception message — so a misbehaving or hostile endpoint
can no longer inflate every script-visible ErrorMessage and Site Event Logging entry
with a multi-megabyte body. When truncation occurs the message is suffixed with
… [truncated, N chars total] so the original size is still visible. Regression test
Call_PermanentFailureWithHugeErrorBody_TruncatesErrorMessage drives a 400 with a
500 KB body and asserts the resulting ErrorMessage is bounded (< 4096 chars); it was
verified to fail (500 040-char message) before the cap was added. Content-type
filtering was considered optional in the recommendation and was not implemented — the
size cap alone closes the inflation/disclosure vector.
ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
| Severity | Medium — re-triaged: root cause already fixed in current source (see Resolution) |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30, src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159 |
Description
ErrorClassifier.IsTransient(Exception) returns true for TaskCanceledException
and OperationCanceledException. HttpClient.SendAsync throws TaskCanceledException
both when its internal timeout elapses and when the supplied CancellationToken is
cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting
down). Because InvokeHttpAsync's catch filter treats all of these as transient, a
caller-initiated cancellation during a CachedCall will be misclassified as a
transient failure and the message will be buffered for retry — work the caller
explicitly asked to abandon. For a Call, a shutdown-time cancellation is reported to
the script as a "Transient error" rather than an OperationCanceledException.
Recommendation
In InvokeHttpAsync, check cancellationToken.IsCancellationRequested first and
rethrow OperationCanceledException (or let it propagate) before applying transient
classification. Only treat a cancellation as a timeout when the supplied token is
not the one that was cancelled.
Resolution
Resolved 2026-05-16 (commit pending). Re-triage: the root cause described — a
caller-initiated cancellation being misclassified as a transient failure — is no
longer present in the current source and is not reproducible. InvokeHttpAsync
already wraps both SendAsync and the response-body ReadAsStringAsync in ordered
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
filters that rethrow the caller's cancellation before the
catch (Exception ex) when (ErrorClassifier.IsTransient(ex)) branch is ever reached
(this was added alongside the ExternalSystemGateway-002 timeout fix). A caller-cancel
therefore propagates as OperationCanceledException and is never buffered; only the
gateway's own timeout token reclassifies as transient.
ErrorClassifier.IsTransient(Exception) does still return true for
TaskCanceledException/OperationCanceledException, but that is correct and
intentional: a TaskCanceledException raised by an HTTP timeout is a genuine
transient failure, and the only caller (InvokeHttpAsync:238) is unreachable for a
caller-cancellation because the two preceding when-filtered catches intercept it
first. The transient-vs-cancel decision is contextual (which token fired) and cannot
be made from the exception type alone, which is exactly why the call site does it.
No source change was required. A regression guard,
CachedCall_CallerCancellation_IsNotBufferedAsTransient, was added: it cancels the
caller token mid-CachedCall and asserts an OperationCanceledException is thrown and
the S&F buffer remains empty (the cancelled work is not retried). The existing
Call_CallerCancellation_IsNotMisreportedAsTimeout covers the synchronous Call path.
ExternalSystemGateway-009 — StoreAndForwardResult from EnqueueAsync is discarded; permanent failures during buffering are swallowed
| Severity | Medium — re-triaged: root cause subsumed by the ExternalSystemGateway-003 dispatch redesign (see Resolution) |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117 |
Description
CachedCallAsync assigns the result of _storeAndForward.EnqueueAsync(...) to
sfResult and then never reads it — it unconditionally returns
new ExternalCallResult(true, null, null, WasBuffered: true). EnqueueAsync can
return Success == false (a permanent failure encountered during its
immediate-delivery attempt — StoreAndForwardService.cs:142) or Buffered == false
(delivered immediately). In both cases the ESG still reports the call as buffered and
successful to the script. A permanent failure surfaced by the S&F immediate attempt is
therefore silently lost instead of being returned to the script as the design requires
("On permanent failure (HTTP 4xx), the error is returned synchronously").
Recommendation
Inspect sfResult: if Success == false return an error ExternalCallResult; set
WasBuffered from sfResult.Buffered rather than hard-coding true. (This finding is
partly subsumed by the dispatch redesign in finding 003.)
Resolution
Resolved 2026-05-16 (commit pending). Re-triage: the stated root cause — "a
permanent failure surfaced by EnqueueAsync's immediate-delivery attempt is silently
lost" — can no longer occur in the current source, and the dead sfResult variable
the finding cites has already been removed. The ExternalSystemGateway-003 fix changed
CachedCallAsync to call EnqueueAsync with attemptImmediateDelivery: false. With
that flag, EnqueueAsync never invokes the registered delivery handler: it skips the
immediate-delivery block entirely (so the StoreAndForwardResult(false, …, …)
permanent-failure return at StoreAndForwardService.cs:147 is unreachable from this
caller) and unconditionally buffers, returning Accepted: true, WasBuffered: true
(StoreAndForwardService.cs:180). The ExternalCallResult(true, null, null, WasBuffered: true) that CachedCallAsync returns is therefore now factually correct
in every reachable case — the message is buffered and there is no swallowed permanent
failure. Permanent HTTP 4xx failures are still surfaced synchronously, because
CachedCallAsync makes its own first HTTP attempt and catches
PermanentExternalSystemException before it ever reaches EnqueueAsync. No source
change was required beyond the ExternalSystemGateway-003 redesign that already landed.
Coverage: CachedCall_TransientFailure_BuffersWithSystemRetrySettings asserts both
result.WasBuffered == true and that the message is genuinely present in the S&F buffer
(depth == 1), confirming the WasBuffered: true claim is not a lie; the existing
CachedCall permanent-failure path is exercised by Call_Permanent400_ReturnsPermanentError
semantics shared via InvokeHttpAsync.
ExternalSystemGateway-010 — GetConnectionAsync leaks the SqlConnection when OpenAsync fails
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50 |
Description
GetConnectionAsync constructs new SqlConnection(...) and calls await connection.OpenAsync(...). If OpenAsync throws (unreachable server, bad
credentials, cancellation) the just-created SqlConnection instance is never disposed
— the exception propagates and the local reference is lost. While an unopened
SqlConnection is lightweight, over many failing calls this is an avoidable leak. The
design doc says Database.Connection() failures return an error to the script; the
current code lets a raw SqlException escape, which is acceptable, but the leak is
not.
Recommendation
Wrap the open in a try/catch that disposes the connection before rethrowing:
try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }.
Resolution
Resolved 2026-05-16 (commit pending). GetConnectionAsync now wraps OpenAsync in a
try/catch that calls await connection.DisposeAsync() before rethrowing, so a failed
open (unreachable server, bad credentials, cancellation) no longer leaks the
SqlConnection. Connection creation was extracted into an internal virtual CreateConnection(string) factory so the failure path is unit-testable. Regression test
GetConnection_OpenFails_DisposesConnectionBeforeRethrowing substitutes a DbConnection
whose OpenAsync always throws and asserts the connection is disposed when the
exception propagates; it was verified to fail before the try/catch was added.
ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:360-374, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:169-176 |
Description
ResolveSystemAndMethodAsync calls GetAllExternalSystemsAsync() and then
GetMethodsByExternalSystemIdAsync() and filters in memory on every single call;
ResolveConnectionAsync calls GetAllDatabaseConnectionsAsync() and filters in memory
on every cached write / connection request. At sites this hits the SQLite repository,
and SiteExternalSystemRepository re-reads and re-parses the methods JSON each time.
For a hot script path this is unnecessary repeated I/O and allocation. Definitions only
change on deployment, so they are eminently cacheable.
Recommendation
Add an in-memory cache of system/method/connection definitions keyed by name, invalidated on artifact deployment. Alternatively use a name-keyed repository lookup rather than fetch-all-then-filter.
Resolution
Resolved 2026-05-16 (commit <pending>). A cross-module change was explicitly
authorized, so the name-keyed repository lookup recommendation was applied — the
cleaner of the two options, and one that avoids the staleness hazard a
deployment-invalidated cache would introduce.
Three name-keyed methods were added to IExternalSystemRepository
(ScadaLink.Commons): GetExternalSystemByNameAsync(name),
GetMethodByNameAsync(externalSystemId, methodName) and
GetDatabaseConnectionByNameAsync(name). The connection lookup belongs on the same
interface because database connection definitions are already part of
IExternalSystemRepository (alongside GetAllDatabaseConnectionsAsync /
GetDatabaseConnectionByIdAsync), so the existing repository organization was
followed rather than introducing a new interface.
Both implementers of the interface were updated:
ScadaLink.ConfigurationDatabase.ExternalSystemRepository— all three are genuine server-side keyed queries (FirstOrDefaultAsync(x => x.Name == name), the method lookup additionally scoped byExternalSystemDefinitionId), matching the existingGetMethodByNameAsync/GetListByNameAsync/GetSharedScriptByNameAsyncconvention in the other Central repositories.ScadaLink.SiteRuntime.SiteExternalSystemRepository—GetExternalSystemByNameAsyncandGetDatabaseConnectionByNameAsyncare genuine single-row indexed SQLite queries (WHERE name = @name; both tables havenameas the PRIMARY KEY).GetMethodByNameAsyncresolves the named method from the parent system'smethod_definitionsJSON column; this still requires resolving the parent system (one id→name scan), but the gateway's new call sequence performs one scan instead of the previous two (GetAllExternalSystemsAsync+GetMethodsByExternalSystemIdAsync, which itself scanned), so the site path is strictly better than before — noted as the one place a residual scan remains, bounded by the deployed system count.
ExternalSystemClient.ResolveSystemAndMethodAsync and
DatabaseGateway.ResolveConnectionAsync were rewritten to call the keyed lookups; the
GetAllExternalSystemsAsync / GetMethodsByExternalSystemIdAsync /
GetAllDatabaseConnectionsAsync + in-memory FirstOrDefault filtering is gone from
both hot paths.
Regression tests (TDD — written first, verified failing/not-compiling before the implementation, then confirmed green; one was additionally verified to fail when the keyed query is deliberately broken):
- ConfigurationDatabase (
RepositoryCoverageTests):GetExternalSystemByName_ReturnsMatchingRow,GetExternalSystemByName_MissingName_ReturnsNull,GetMethodByName_ReturnsMethodScopedToParentSystem,GetMethodByName_MissingMethod_ReturnsNull,GetDatabaseConnectionByName_ReturnsMatchingRow,GetDatabaseConnectionByName_MissingName_ReturnsNull. - SiteRuntime (
SiteRepositoryTests):ExternalSystemRepository_GetByName_ReturnsMatchingDefinition,ExternalSystemRepository_GetByName_MissingName_ReturnsNull,ExternalSystemRepository_GetMethodByName_ResolvesScopedToSystem,DatabaseConnectionRepository_GetByName_ReturnsMatchingDefinition. - ExternalSystemGateway: the existing
ExternalSystemClientTests/DatabaseGatewayTestsresolution stubs were migrated to the keyed methods (viaStubResolution/StubConnectionhelpers), so the full gateway suite now exercises and protects the keyed-lookup resolution path.
dotnet build ScadaLink.slnx is clean; ScadaLink.ExternalSystemGateway.Tests (54),
ScadaLink.ConfigurationDatabase.Tests (106), ScadaLink.SiteRuntime.Tests (196) and
ScadaLink.Commons.Tests (226) all pass.
ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; _logger is injected but unused
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22 |
Description
The design doc states permanent failures are "Logged to Site Event Logging", but
InvokeHttpAsync performs no logging on the permanent-failure path. In fact the
injected ILogger<ExternalSystemClient> and ILogger<DatabaseGateway> fields are
never used at all in either class. Either the logging is expected to happen in the
caller (Script Execution Actor) — in which case the design doc is imprecise about
where — or it is missing. Separately, IsTransient(HttpStatusCode) treats any
non-success, non-(5xx/408/429) status as permanent without an explicit comment, which
is a reasonable default but undocumented.
Recommendation
Add a _logger.LogWarning on the permanent-failure path (and a debug log on
transient), or clarify in the design doc that Site Event Logging capture is the
caller's responsibility and remove the unused _logger fields. Add a comment in
ErrorClassifier documenting the "default to permanent" behaviour.
Resolution
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: the
"_logger injected but unused" claim is stale — both loggers are already used (the
DeliverBufferedAsync retry-sweep handlers added by ExternalSystemGateway-001 log
park/error events). The genuine remaining gap — InvokeHttpAsync performing no logging
on the HTTP-failure paths — is now fixed: InvokeHttpAsync emits a
_logger.LogWarning on the permanent-failure path (status code, system, method,
truncated error body) so a permanent failure is visible in Site Event Logging as the
design requires, and a _logger.LogDebug on the transient-failure path (transient
failures are normal retry/S&F operation and must not be noisy at warning level). A
documentation comment was added to ErrorClassifier.IsTransient(HttpStatusCode)
explaining the "any other non-success status defaults to permanent" behaviour and why
permanent is the safe default. Regression tests: Call_PermanentFailure_LogsAWarning
(asserts a warning carrying the system name is emitted; verified to fail before the
LogWarning was added) and Call_TransientFailure_DoesNotLogAtWarningOrAbove (guards
against over-logging transient failures).
ExternalSystemGateway-013 — MaxConcurrentConnectionsPerSystem and DefaultHttpTimeout options are defined but never used
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13 |
Description
ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem (default 10) and
DefaultHttpTimeout (default 30s) are bound from configuration but neither is read
anywhere. AddHttpClient() registers the default factory with no
ConfigurePrimaryHttpMessageHandler/SocketsHttpHandler MaxConnectionsPerServer and
no Timeout, so both options have no effect. An operator setting these values gets
them silently ignored — a misleading configuration surface (DefaultHttpTimeout is
also referenced by finding 002).
Recommendation
Either wire the options into a named/typed HttpClient registration (set
MaxConnectionsPerServer on the primary handler, set Timeout), or remove the unused
options to avoid implying behaviour that does not exist.
Resolution
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text:
DefaultHttpTimeout is no longer unused — it became the effective per-call HTTP
round-trip limit when ExternalSystemGateway-002 was fixed (InvokeHttpAsync builds a
linked CancellationTokenSource(DefaultHttpTimeout)). The genuinely-unused option,
MaxConcurrentConnectionsPerSystem, is now wired in: AddExternalSystemGateway adds a
ConfigureHttpClientDefaults registration that supplies a SocketsHttpHandler whose
MaxConnectionsPerServer is bound from the option, so it applies to every per-system
named client (ExternalSystem_{name}) the gateway creates rather than being silently
ignored. Regression test
ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsAppliedToTheNamedHttpClientPrimaryHandler
builds the DI container, resolves the named client's handler chain via
IHttpMessageHandlerFactory, walks to the primary handler and asserts
SocketsHttpHandler.MaxConnectionsPerServer equals the configured value; it was verified
to fail before the wiring was added.
ExternalSystemGateway-014 — Cached-call buffering path and DatabaseGateway are untested
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1, tests/ScadaLink.ExternalSystemGateway.Tests/DatabaseGatewayTests.cs |
Description
ExternalSystemClientTests covers system/method not-found, success, transient 500 and
permanent 400 for CallAsync, plus CachedCall not-found and success. It does not
cover: the CachedCall transient-failure → S&F buffering branch (the most
behaviour-rich path, including the _storeAndForward == null fallback and WasBuffered
semantics), the CachedCall permanent-failure branch, connection-exception
classification (HttpRequestException thrown by the handler), BuildUrl query-string
construction, and ApplyAuth for the apikey/basic variants. There is no test file
for DatabaseGateway at all — GetConnectionAsync not-found, CachedWriteAsync
not-found, and the _storeAndForward == null guard are entirely uncovered. The
MockHttpMessageHandler also does not assert request URL/headers/body, so auth and
URL construction are unverified.
Recommendation
Add tests for the CachedCall transient/buffering paths (with a substituted S&F
service), DatabaseGateway not-found and null-S&F guards, and BuildUrl/ApplyAuth
by asserting on the captured HttpRequestMessage in the mock handler.
Resolution
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: a number
of the listed gaps were already closed by tests added when findings 001–010 were
fixed — DatabaseGatewayTests.cs now exists (not-found, null-S&F guard, buffered-write
delivery), and the CachedCall transient/buffering and permanent paths are covered by
the ExternalSystemGateway-001/003/004/008 regression tests. The remaining genuine
coverage gaps are now closed with new tests:
Call_GetWithParameters_AppendsEscapedQueryString—BuildUrlGET query-string construction, asserting on the captured request URI and that an&inside a value is percent-escaped.Call_PostWithParameters_SendsJsonBody— POST JSON-body construction.Call_ApiKeyAuthWithDefaultHeader_SendsXApiKeyHeader,Call_ApiKeyAuthWithCustomHeader_SendsNamedHeader,Call_BasicAuth_SendsBase64AuthorizationHeader—ApplyAuthfor all three auth variants, asserting on the captured request headers.Call_ConnectionError_IsClassifiedAsTransient— a connection-levelHttpRequestExceptionis classified transient.CachedWrite_BuffersTheWriteWithConnectionRetrySettingsandCachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset— theDatabaseGatewayCachedWritehappy-path enqueue against a real S&F service.
The shared RequestCapturingHandler test helper was extended to capture request
headers and body so URL/auth/body construction is now verified, not just status codes.
These are new-coverage tests against already-correct behaviour, so they pass on the
current source; the BuildUrl and ApplyAuth paths they exercise are now protected
against regression.
ExternalSystemGateway-015 — MaxRetries == 0 is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108 |
Description
The ExternalSystemGateway-004 fix removed the MaxRetries > 0 ? ... : null guard so
that CachedCallAsync and CachedWriteAsync now pass the definition's MaxRetries
to StoreAndForwardService.EnqueueAsync verbatim. The stated rationale (and the inline
comments at ExternalSystemClient.cs:118-119 / DatabaseGateway.cs:99-101, plus the
tests CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset and
CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset) is that a configured
MaxRetries of 0 means "never retry".
That is the opposite of what the Store-and-Forward engine actually does with the
value. EnqueueAsync stores the passed maxRetries directly into
StoreAndForwardMessage.MaxRetries (StoreAndForwardService.cs:139), whose own XML
doc states "Maximum retry-sweep attempts before parking (0 = no limit)"
(StoreAndForwardMessage.cs:30). The retry sweep enforces it as
if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)
(StoreAndForwardService.cs:285) — when MaxRetries == 0 that guard is false on
every sweep, so the message is never parked and is retried forever.
Consequences for a CachedCall/CachedWrite against a system or connection
configured with MaxRetries = 0:
- A transiently-failing message that the operator intended never to retry is instead retried on every sweep indefinitely, accumulating in the buffer and repeatedly re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard the idempotency note in the design doc warns about.
- The two ESG regression tests cited above assert
max_retries == 0is stored and describe that as "honoured" — they verify the persisted column, never the resulting sweep behaviour, so they lock in the broken semantics. - Because the SiteRuntime repository still always supplies
MaxRetries == 0(the open companion gap noted inExternalSystemGateway-004), every cached call and cached write at every site currently buffers as retry-forever. Before the ESG-004 fix the> 0guard sentnull, so the S&FDefaultMaxRetries(a bounded value) applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the common path.
Recommendation
Reconcile the ESG and S&F interpretations of MaxRetries == 0 — they must agree.
Either: (a) treat the entity's MaxRetries == 0 as "unset" and pass null so the
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
that "never retry" then needs a different representation such as a nullable field or a
MaxRetries == 1 convention); or (b) if 0 genuinely must mean "never retry", add an
explicit no-retry path — e.g. do not enqueue at all on transient failure when
MaxRetries == 0, or introduce a distinct sentinel — and fix the
StoreAndForwardMessage.MaxRetries doc and RetryMessageAsync guard so 0 no longer
means "no limit". Update the two ZeroMaxRetries... tests to assert the sweep
outcome (parked / not retried), not just the stored column value.
Resolution
Resolved 2026-05-17. Root cause confirmed: the S&F engine treats a stored
MaxRetries == 0 as "no limit / retry forever" (StoreAndForwardMessage.MaxRetries
doc "0 = no limit"; sweep guard MaxRetries > 0 && RetryCount >= MaxRetries), while
the entity's non-nullable int MaxRetries defaults to 0 — so passing it verbatim
buffered every cached call/write as an unbounded retry loop. Fix (ESG-side only,
recommendation (a)): CachedCallAsync and CachedWriteAsync now pass
MaxRetries > 0 ? MaxRetries : null, so an entity 0 is treated as "unset" and the
bounded S&F DefaultMaxRetries applies; the misleading "0 = never retry" inline
comments were corrected. The two ZeroMaxRetries... tests were rewritten to
CachedCall_TransientFailure_ZeroMaxRetriesIsTreatedAsUnsetNotRetryForever /
CachedWrite_ZeroMaxRetriesIsTreatedAsUnsetNotRetryForever, asserting the buffered
message carries the bounded default (99) and never 0.
ExternalSystemGateway-016 — ConfigureHttpClientDefaults applies the ESG connection cap to every HttpClient in the host process
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29 |
Description
The ExternalSystemGateway-013 fix wires MaxConcurrentConnectionsPerSystem by
calling services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...)).
The inline comment claims this "applies to the dynamically-named clients created by
ExternalSystemClient" — but ConfigureHttpClientDefaults is process-global: it
adds the configuration to every HttpClient/IHttpClientFactory client created
anywhere in the host, regardless of name.
The Host registers the External System Gateway alongside other components that also
use IHttpClientFactory — notably ScadaLink.NotificationService (OAuth2TokenService
and its ServiceCollectionExtensions call AddHttpClient). With the ESG registration
present, the OAuth2 token client (and any future HttpClient consumer in the host)
has its primary handler replaced by a SocketsHttpHandler whose
MaxConnectionsPerServer is the ESG's MaxConcurrentConnectionsPerSystem. That:
- Silently caps unrelated clients at a value owned by, and named for, a different component — an operator tuning the ESG option unknowingly throttles Microsoft 365 OAuth2 token acquisition.
- Overrides/discards any primary-handler configuration those other components add for their own clients (e.g. a custom handler, proxy, or certificate settings).
This is a leaky, surprising side effect for what the option claims to be a per-ESG
setting; ConfigureHttpClientDefaults should not be used to express a single
component's policy.
Recommendation
Scope the handler configuration to the gateway's own clients only. The ESG already
creates per-system clients with a deterministic name pattern
(ExternalSystem_{system.Name}); register a typed/named HttpClient (or a small
factory abstraction) for that pattern and call ConfigurePrimaryHttpMessageHandler
on that registration instead of on the global defaults. If a name-pattern registration
is impractical, document the global effect explicitly and rename the option, but the
preferred fix is to stop using ConfigureHttpClientDefaults.
Resolution
Resolved 2026-05-17. Root cause confirmed: ConfigureHttpClientDefaults is
process-global and replaced the primary handler of every IHttpClientFactory client
in the host, leaking the ESG connection cap onto unrelated clients. Fix: the global
ConfigureHttpClientDefaults registration was replaced with an
IConfigureNamedOptions<HttpClientFactoryOptions> (GatewayHttpClientConfigurator)
that applies the SocketsHttpHandler/MaxConnectionsPerServer cap only to clients
whose name starts with ExternalSystem_ (the gateway's own per-system clients), so
clients owned by other components keep their own (or the framework default) primary
handler. Regression test
ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsNotAppliedToNonGatewayHttpClients
asserts a non-gateway client does not inherit the cap while the gateway client still
does; it was verified to fail before the fix.
ExternalSystemGateway-017 — BuildUrl appends a bare trailing ? when a GET method's parameters are all null
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333 |
Description
In BuildUrl, the GET/DELETE query-string branch is entered when
parameters != null && parameters.Count > 0, but the projection then filters out
null-valued entries (parameters.Where(p => p.Value != null)). When a GET/DELETE
method is invoked with a non-empty parameter dictionary whose values are all null,
queryString is the empty string and the code still executes url += "?" + queryString,
producing a URL ending in a bare ? (e.g. https://host/api/resource?). Most servers
tolerate a trailing ?, but it is an unintended artifact, can defeat response caching
on some endpoints, and makes captured request URLs harder to read in logs.
Recommendation
Only append "?" + queryString when queryString is non-empty (compute the joined
string first and check it), so a method whose effective parameter set is empty
produces a clean URL identical to the no-parameters case.
Resolution
Resolved 2026-05-17. Root cause confirmed: BuildUrl appended "?" + queryString
whenever the GET/DELETE parameter dictionary was non-empty, even when every value
was null and queryString was the empty string, yielding a bare trailing ?. Fix:
BuildUrl now appends "?" + queryString only when queryString.Length > 0, so a
method whose effective parameter set is empty produces a URL identical to the
no-parameters case. Regression test
Call_GetWithAllNullParameters_DoesNotAppendTrailingQuestionMark asserts the
captured request URI has no trailing ?; it was verified to fail before the fix.
ExternalSystemGateway-018 — DeliverBufferedAsync lets JsonException propagate, turning a corrupt buffered row into a permanent retry-forever poison message
| Severity | High |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:176, src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:151 |
Resolution — Wrapped the JsonSerializer.Deserialize<...>(message.PayloadJson) call in both ExternalSystemClient.DeliverBufferedAsync and DatabaseGateway.DeliverBufferedAsync in a try/catch (JsonException) block. A JsonException is by definition permanent (the same payload bytes always deserialize identically), so the catch branch logs at LogError and returns false, parking the message via the S&F engine instead of letting it throw and be retried as a transient failure. Regression tests DeliverBuffered_MalformedJsonPayload_ReturnsFalseSoMessageParks were added to both ExternalSystemClientTests and DatabaseGatewayTests — each feeds a truncated PayloadJson to the handler and asserts delivered == false and that no exception escapes.
Description
Both ExternalSystemClient.DeliverBufferedAsync and DatabaseGateway.DeliverBufferedAsync
begin with an unguarded JsonSerializer.Deserialize<...>(message.PayloadJson):
var payload = JsonSerializer.Deserialize<CachedCallPayload>(message.PayloadJson);
if (payload == null || string.IsNullOrEmpty(payload.SystemName) || ...) {
_logger.LogError("... unreadable payload; parking.");
return false;
}
The "unreadable payload; parking" branch is only entered when Deserialize succeeds
and produces a null / partially-empty object. If PayloadJson is malformed JSON —
the column was truncated mid-write, an older payload schema is being deserialized into a
newer record, or storage corruption occurred — Deserialize throws JsonException
before that check is ever reached. The exception propagates out of the delivery handler.
The Store-and-Forward retry loop treats any thrown exception from a delivery handler
as a transient failure (only a returned false parks the message); see
StoreAndForwardService.RetryMessageAsync. Combined with the MaxRetries == 0 →
"unset → bounded default" fix from -015, the resulting behaviour is:
- Corrupt payload arrives in the buffer.
- Every retry sweep deserializes, throws
JsonException, incrementsRetryCount. - The message is retried until
RetryCount >= MaxRetries, then parked — only ifMaxRetries > 0is configured (which-015already established is not the default site configuration today). With the bounded S&F default it does eventually park, but it park-loops noisily forDefaultMaxRetriesiterations first; without that bound it retries forever. - The script is unaware — the cached call was returned
WasBuffered: truelong ago.
This is the same "poison message buffered forever" class of hazard that
ExternalSystemGateway-001 (no-handler) and ExternalSystemGateway-015 (MaxRetries==0)
already removed for their own causes; corrupt JSON is an alternative arrival path into
the same bad state.
The DatabaseGateway.DeliverBufferedAsync path has the same shape and the same defect:
JsonSerializer.Deserialize<CachedWritePayload> at line 151 is not guarded.
Recommendation
Wrap the Deserialize call in a try/catch (JsonException) block in both
DeliverBufferedAsync methods. A JsonException is by definition a permanent failure —
re-running the same deserialization against the same payload will produce the same
exception — so the catch should log at LogError and return false so the S&F
engine parks the message rather than retrying. Add regression tests that feed a
malformed PayloadJson to each handler and assert delivered == false (i.e. the
message parks) and that no exception escapes the handler.
ExternalSystemGateway-019 — HttpClient.Timeout is not set; DefaultHttpTimeout > 100s is silently clipped by the framework default
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:226,257-264, src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:90-102 |
Resolution (2026-05-28): Set client.Timeout = Timeout.InfiniteTimeSpan immediately after _httpClientFactory.CreateClient($"ExternalSystem_{system.Name}") in ExternalSystemClient.InvokeHttpAsync, disabling the framework's 100 s default so the per-call CancellationTokenSource(_options.DefaultHttpTimeout) linked CTS already built below is the sole timeout source. An operator-configured DefaultHttpTimeout greater than 100 s is now honoured verbatim instead of being silently clipped and misclassified as a transient "connection error". Kept the fix local to the allowed file (ExternalSystemClient.cs) rather than touching ServiceCollectionExtensions.cs/GatewayHttpClientConfigurator. Regression test Call_DisablesHttpClientFrameworkTimeoutSoLongTimeoutsArentClipped asserts the rented client starts with the framework's 100 s default and is set to Timeout.InfiniteTimeSpan after InvokeHttpAsync runs.
Description
The -002 fix enforces the per-call timeout via a linked CancellationTokenSource
built from _options.DefaultHttpTimeout and passed into SendAsync. That correctly
caps every call to at most the configured value when DefaultHttpTimeout ≤ 100s.
However, HttpClient.Timeout (the framework default) is never set on either the named
client or its primary handler — the GatewayHttpClientConfigurator only sets
MaxConnectionsPerServer. HttpClient.Timeout defaults to 100 seconds, and
SendAsync enforces it internally by cancelling its own private CTS, raising a
TaskCanceledException from SendAsync without cancelling either the caller's token
or the gateway's timeoutCts.
Consequences when an operator configures DefaultHttpTimeout to anything > 100s
(a legitimate setting for external systems with long-running endpoints — recipe
exports, large queries):
- The gateway's
timeoutCts(e.g. 5 minutes) has not yet fired. HttpClient.Timeoutfires at 100s,SendAsyncthrows.- Neither
when (cancellationToken.IsCancellationRequested)norwhen (timeoutCts.IsCancellationRequested)matches, so the exception falls into the genericcatch (Exception ex) when (ErrorClassifier.IsTransient(ex))branch (line 277) and is re-thrown as aTransientExternalSystemExceptionwith the message"Connection error to {Name}: A task was canceled."— misattributing a timeout as a connection error. - The configured 5-minute round-trip window the design doc promises ("Each external system definition specifies a timeout that applies to all method calls on that system" / "applies to the HTTP request round-trip") is silently overridden.
The opposite case (DefaultHttpTimeout < 100s) is the only one the -002 regression
test exercises (200ms), so the defect is not caught by the existing suite.
Recommendation
Set HttpClient.Timeout = Timeout.InfiniteTimeSpan on the gateway's named clients via
the existing GatewayHttpClientConfigurator (delegate HttpClientActions rather than
just HttpMessageHandlerBuilderActions), so the cancellation-token mechanism is the
sole timeout source. The linked timeoutCts then reliably enforces
DefaultHttpTimeout for every value, and the timeout-vs-cancellation classification at
lines 266–276 stays accurate. Add a regression test that configures DefaultHttpTimeout
to ~150s, hangs the handler, and asserts the call times out at the configured value
and produces a "Timeout calling..." (not "Connection error to...") error.
ExternalSystemGateway-020 — JsonElementToParameterValue silently downcasts non-Int64 JSON numbers to double, losing precision for decimal SQL parameters on retry
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:185-193 |
Resolution (2026-05-28): JsonElementToParameterValue now probes TryGetInt64 → TryGetDecimal → GetDouble, so a JSON number that fits in decimal materialises as a decimal (preserving the script's authored precision on cached-write retries) and only genuinely out-of-decimal-range values fall through to double. Regression test JsonElementToParameterValue_DecimalShapedNumber_PreservesPrecisionViaDecimal round-trips 1234567890.1234567890 through a JsonElement and asserts the result is a decimal carrying the original precision; companion tests guard the long-fast-path and the out-of-range-double fallback.
Description
DatabaseGateway.JsonElementToParameterValue materialises the buffered cached-write
SQL parameter values during a retry-sweep delivery:
private static object JsonElementToParameterValue(JsonElement element) => element.ValueKind switch
{
JsonValueKind.String => (object?)element.GetString() ?? DBNull.Value,
JsonValueKind.Number => element.TryGetInt64(out var l) ? l : element.GetDouble(),
...
};
For a JSON number, the helper attempts Int64 first and otherwise returns a double.
There is no decimal branch. The immediate-attempt path is unaffected — CachedWriteAsync
on the original call serializes the script-provided typed parameters via
JsonSerializer.Serialize(new { ConnectionName, Sql, Parameters = parameters }) and
executes the SQL directly outside this code path. But the retry path runs through
DeliverBufferedAsync → JsonElementToParameterValue, so a script that submitted
a decimal value (e.g. 123.4567890123m) gets:
- Immediate attempt:
decimalparameter, full precision (or, more accurately, the value never enters this helper because cached writes today never re-execute on the immediate path — but on the retry path it does). - Retry attempt(s) after a transient failure: the value is deserialized as a JSON
number, fails
TryGetInt64, and is downcast todouble— which has ~15–17 digits of precision againstdecimal's 28–29. A SQL column of typedecimal(18, 6)ornumericreceives a value that has been truncated todoubleprecision before parameter binding.
Two further consequences worth recording:
- The downcast is silent — there is no log, no error, and the cached-write acknowledgement to the script has long since happened. Data drift between a same-call immediate-success delivery and a same-call retry delivery is the worst shape of "looks like the right value but isn't" defect.
- For SCADA telemetry (process variables, totals, currency-denominated quality
reports)
decimalis the correct CLR type anddouble's representation error changes the persisted value.
Recommendation
Replace the Number branch with a precision-preserving cascade — try Int64, then
decimal (element.TryGetDecimal(out var d) ? d : element.GetDouble()), and only
fall back to double when even decimal fails. Add a regression test against
DatabaseGateway.DeliverBufferedAsync that buffers a write with a high-precision
decimal parameter, drives the delivery, and asserts the SQL parameter bound is a
decimal (or compares the round-tripped value to the original at the parameter level)
rather than a double with truncated precision. The same Number-branch decision should
be reviewed against JsonValueKind.True/False/Null (currently fine) and a string
that happens to encode a number (already correctly returns string).
ExternalSystemGateway-021 — ApplyAuth silently sends an unauthenticated request on unknown AuthType, empty AuthConfiguration, or malformed Basic config
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:385-415 |
Resolution (2026-05-28): ApplyAuth is now an instance method that uses
the existing _logger. Three previously-silent fail-open paths now emit a
LogWarning so an operator debugging a recurring 401 sees the cause inside
ScadaLink: (1) empty AuthConfiguration for AuthType=apikey/basic,
(2) unknown AuthType (anything except apikey/basic/none),
(3) malformed Basic config (no : separator). The AuthConfiguration
value is NEVER included in the log message. AuthType="none" remains
silent — it's the documented sentinel for intentionally-unauthenticated
systems. Behaviour is otherwise unchanged: the request still goes out
(never block on auth-config issues), the failure mode is just now visible.
Description
ApplyAuth has three fail-open paths that all result in an HTTP request being sent
without the credential the operator configured:
- Line 387 —
if (string.IsNullOrEmpty(system.AuthConfiguration)) return;returns early regardless ofAuthType. A system entity withAuthType = "apikey"but an emptyAuthConfiguration(e.g. the secret column failed to deploy, or the protector key changed and decryption produced"") sends every request with noX-API-Keyheader — the gateway is silent. - The
switchhas nodefaultarm. A system entity withAuthType = "bearer","oauth2", a typo like"ApiKey "(trailing space) or even"none"falls off theswitchand the request is sent without any auth header — again silent. - Line 408 —
if (basicParts.Length == 2)skips the auth attach whenAuthConfigurationforbasiclacks a:separator. The request is sent with noAuthorizationheader.
Effectively the gateway treats every misconfiguration as "send anonymously" and
relies on the remote system rejecting it with a 401/403. That is a defensible default
on its own, but combined with -007's 2 KB error-body cap and the fact that no audit
or warning is emitted, an operator debugging "why does my external system always
return 401" has nothing to go on inside ScadaLink — the gateway never says it failed
to apply auth. For AuthType = "none" (the design's expected sentinel for
unauthenticated systems) the fall-through is correct; the failure mode is misconfig.
Recommendation
Add a default: arm to the switch that logs _logger.LogWarning(...) naming the
unknown AuthType and the system, and emit a similar warning when
AuthConfiguration is empty for an AuthType of "apikey" or "basic" (those
require a value; "none" does not). For Basic auth specifically, the
basicParts.Length != 2 branch should also warn. Do not include the
AuthConfiguration value in the log message — secrets must stay out of the log
(consistent with the existing module). A small set of ApplyAuth unit tests
verifying the warning emission and that no Authorization / X-API-Key header is
ever leaked in the warning text would close the test gap as well.
ExternalSystemGateway-022 — new HttpMethod(method.HttpMethod) accepts any string at runtime; an invalid HTTP verb fails only at call time
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:233 |
Resolution (2026-05-28): Added a ValidateHttpMethod helper called at the top of InvokeHttpAsync that rejects any verb outside the documented GET/POST/PUT/PATCH/DELETE allowlist (matching ESG-023's design-doc reconciliation) with a clear ArgumentException naming the offending verb. Allowlist is a HashSet<string> with OrdinalIgnoreCase so the operator-authored entity column is case-insensitive. Regression tests Call_UnsupportedHttpMethod_ThrowsArgumentException (Theory: FOO/DLETE/GIT/OPTIONS/HEAD) and Call_DocumentedHttpMethod_IsAccepted (Theory: GET/get/Post/PATCH/delete) cover the rejection and the case-insensitive accept paths.
Description
InvokeHttpAsync constructs the request method directly from the string column:
new HttpRequestMessage(new HttpMethod(method.HttpMethod), url). System.Net.Http.HttpMethod(string)
performs only a token-character validation (it rejects whitespace and control chars
but accepts arbitrary non-standard tokens like "FOO" or "GIT"). The body-vs-query
selection at lines 239–250 explicitly checks for POST/PUT/PATCH; for any other
non-standard verb ("FOO") the parameters silently go to neither body nor query
string and the request is dispatched anyway.
The design doc enumerates GET/POST/PUT/DELETE as the supported set. There is no
validation at deployment time, at definition save time, or at gateway
resolution time that method.HttpMethod is one of the expected verbs. An operator
who typos "DLETE" discovers the issue only when a script invokes that method and
the remote server rejects the request — usually as a 4xx that the gateway classifies
as permanent, which is correct but obscures the root cause.
Recommendation
Validate method.HttpMethod at gateway entry — either with a small switch of
allowed verbs in InvokeHttpAsync that throws PermanentExternalSystemException for
an unsupported verb (cheap, immediate, surfaces a clear error to the script), or by
adding a validation pass in the Template/Deployment Manager so it can never reach
the gateway. The first option is local to this module and cheaper to land. Either
way, the canonical list should agree with BuildUrl's query-vs-body decision (which
currently knows about POST/PUT/PATCH for body and GET/DELETE for query — note PATCH
is in the body branch but not the design-doc list; see finding 023).
ExternalSystemGateway-023 — PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:241, docs/requirements/Component-ExternalSystemGateway.md:43 |
Resolution (2026-05-28): Doc-only fix; confirmed PATCH is wired in ExternalSystemClient.cs:258-260 alongside POST/PUT for body serialization. Added PATCH to the design doc's HTTP-method list (line 42) and updated the body/query-parameter sentence (line 75) so the documented set matches the code's body = POST/PUT/PATCH; query = GET/DELETE split.
Description
The component design doc lists the supported HTTP methods as GET, POST, PUT, or DELETE (line 43: **HTTP method**: GET, POST, PUT, or DELETE.). InvokeHttpAsync's
body-serialization branch at lines 239–250 explicitly includes PATCH alongside POST
and PUT — so PATCH is in fact supported (and routes parameters into the JSON body),
but operators reading the spec would not know it. Conversely, BuildUrl's
query-string branch at lines 364–366 lists only GET and DELETE, so a PATCH
method's parameters always go to the body, matching the body-branch but not appearing
anywhere in the documented contract.
This is mild drift — the code is more permissive than the spec. It only becomes a real issue if a future change relies on the documented "only GET/POST/PUT/DELETE" set and breaks the PATCH path silently, or if PATCH is genuinely out of scope and a template author defines a PATCH method on purpose only to learn later it is unsupported.
Recommendation
Pick one direction and apply it in the same session, per the project's "design doc + code travel together" rule:
- If PATCH is intentionally supported, add
PATCHto the Component-ExternalSystemGateway.md HTTP-method list (line 43) and add a parameterised test confirming a PATCH method sends its parameters in the JSON body and resolves like POST/PUT for error classification. - If PATCH is not in scope, remove
method.HttpMethod.Equals("PATCH", ...)from the body branch inInvokeHttpAsyncand let finding-022's verb validation reject it. The design-doc list then remains the single source of truth.