11950b0a8e
Final themed batch. 5 well-localised correctness fixes. Serialisation precision: - ESG-020: DatabaseGateway.JsonElementToParameterValue probes TryGetInt64 → TryGetDecimal → GetDouble, so a script's high-precision decimal SQL parameter survives the cached-write retry round-trip without silent precision loss. 3 new regression tests. Template engine correctness: - TE-018: DiffService gains ComputeConnectionsDiff over FlattenedConfiguration.Connections, mirroring the existing entity-diff shape and pairing with the Theme 1 TE-017 hash-coverage fix. A ConfigurationDiff record extension in Commons is flagged as a follow-up. - TE-019: TemplateResolver.BuildInheritanceChain now walks via the int? ParentTemplateId directly — only null means "no parent". A real Id of 0 (the prior special-cased sentinel) now walks the chain like any other node, matching the TemplateEngine-013 CycleDetector fix. Regression of TE-013 closed. - TE-020: All 5 Create* paths in TemplateService + SharedScriptService re-ordered to save-first → log-with-real-Id → save-audit (matching the InstanceService pattern). Create* audit rows no longer carry a literal "0" EntityId. Doc deferral: - Transport-012: Component-Transport.md §Audit Trail now spells out that the BundleImportId repository filter IS wired (in CentralUiRepository), but the Audit-Log-Viewer UI dropdown + summary-row hyperlink are a deferred CentralUI follow-up. CLI workaround documented (audit query --bundle-import-id). 11+ new regression tests (3 ESG, 4 DiffService, 3 TemplateResolver, 4 TemplateService, 1 SharedScriptService). Build clean; ESG 72/72, TemplateEngine 324/324. README regenerated: 1 pending of 481 total. Session-to-date: 135 of 136 originally-open Theme findings closed across 10 themes in 10 commits.
1315 lines
79 KiB
Markdown
1315 lines
79 KiB
Markdown
# Code Review — ExternalSystemGateway
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-28 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1eb6e97` |
|
||
| Open findings | 0 |
|
||
|
||
## Summary
|
||
|
||
The External System Gateway is a small module (five source files plus options) that
|
||
implements the HTTP/REST client (`ExternalSystemClient`), the database access surface
|
||
(`DatabaseGateway`), and error classification (`ErrorClassifier`). The structure is
|
||
clean and the dual call-mode semantics broadly match the design doc. However, the
|
||
review surfaced several substantive problems that prevent the module from behaving as
|
||
designed. The most serious is that **no store-and-forward delivery handler is ever
|
||
registered** for the `ExternalSystem` or `CachedDbWrite` categories, so cached calls
|
||
and cached writes are buffered but can never actually be delivered on retry — a silent
|
||
data-loss path. Two further high-impact issues are that the **per-system call timeout
|
||
is never applied** to the HTTP client (the design's central error-handling guarantee
|
||
is absent), and that **`CachedCall` double-dispatches the HTTP request** because
|
||
`StoreAndForwardService.EnqueueAsync` itself re-attempts immediate delivery, breaking
|
||
the idempotency expectations. A cluster of medium issues concern resource leaks,
|
||
classification gaps (cancellation conflation), and the dropped `StoreAndForwardResult`.
|
||
Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGateway`
|
||
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
|
||
requirements (timeout, retry settings) that are declared but not implemented.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
All fourteen prior findings remain `Resolved`; the resolutions for findings 001–014
|
||
were spot-checked against the current source and hold. The re-review walked the full
|
||
10-category checklist again and surfaced **three new findings**. The most serious
|
||
(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
|
||
`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
|
||
per-system/per-connection `MaxRetries` of `0` through verbatim, but
|
||
`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
|
||
**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
|
||
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
|
||
assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
|
||
`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
|
||
**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
|
||
option) onto every `HttpClient` in the host, including the Notification Service's
|
||
OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
|
||
(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
|
||
both substantive findings are second-order defects in earlier fixes — the earlier
|
||
resolutions did not verify the downstream contract of the S&F engine they integrate
|
||
with.
|
||
|
||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||
|
||
All seventeen prior findings (001–017) remain `Resolved`; spot-checks against the
|
||
current source confirm the fixes still hold. Between `39d737e` and this re-review the
|
||
only source changes to the module are the documentation-only commit `1eb6e97` (XML
|
||
doc additions) and the `executionId` / `sourceScript` / `parentExecutionId` plumbing
|
||
threaded through `CachedCallAsync` / `CachedWriteAsync` to the S&F enqueue (Audit Log
|
||
#23 Tasks 4/6). The re-review walked the full 10-category checklist again and
|
||
surfaced **six new findings**, none Critical. The most serious
|
||
(`ExternalSystemGateway-018`, High) is that `DeliverBufferedAsync` on both
|
||
`ExternalSystemClient` and `DatabaseGateway` lets a `JsonException` from
|
||
`JsonSerializer.Deserialize` propagate out of the delivery handler — the S&F engine
|
||
treats any thrown exception as a transient retry, so a corrupted or
|
||
schema-incompatible buffered row becomes a permanent poison message that is retried
|
||
on every sweep forever (the same retry-forever class of hazard `-015` already
|
||
addressed for a different cause). `ExternalSystemGateway-019` (Medium) is that
|
||
`HttpClient.Timeout` is never set, so any operator-configured `DefaultHttpTimeout`
|
||
greater than 100s is silently clipped by `HttpClient`'s built-in 100s default and the
|
||
gateway's "timeout applies to the HTTP request round-trip" guarantee no longer
|
||
holds — a partial reopen of the `-002` contract for the long-timeout case.
|
||
`ExternalSystemGateway-020` (Medium) is a silent precision-loss bug in the cached-DB-write
|
||
retry path: `JsonElementToParameterValue` collapses any JSON number that is not
|
||
Int64-convertible to `double`, so a script's `decimal` SQL parameter is downcast on
|
||
retry and only on retry. The remaining three (`-021`/`-022`/`-023`, Low) are an
|
||
unauthenticated-by-default `ApplyAuth` for unknown `AuthType` / malformed Basic config,
|
||
runtime-only HTTP-verb validation, and an undocumented PATCH HTTP method (code vs
|
||
design-doc drift). Theme: every new finding is in a code path that was added or
|
||
touched by the earlier fix bundle but whose error-propagation contract was not
|
||
verified end-to-end against the S&F engine or the design doc.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
|
||
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
|
||
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
|
||
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
|
||
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
|
||
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
|
||
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
|
||
| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
|
||
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
|
||
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
|
||
|
||
_Re-review (2026-05-28, `1eb6e97`):_
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | `JsonException` not caught in either `DeliverBufferedAsync`, so a corrupt buffered payload becomes a permanent poison-message retried forever — finding 018. `JsonElementToParameterValue` collapses a non-Int64 number to `double`, silently losing precision for `decimal` SQL parameters on cached-write retry — finding 020. `new HttpMethod(method.HttpMethod)` accepts any string at runtime, so an invalid HTTP verb is only diagnosed at call time — finding 022. |
|
||
| 2 | Akka.NET conventions | ☑ | Still no actors in this module; `AddExternalSystemGatewayActors` remains a no-op. The cached-call lifecycle/audit emission lives in `ScriptRuntimeContext` / `CachedCallTelemetryForwarder` (SiteRuntime / AuditLog), not here, and that boundary is correct. No issues found. |
|
||
| 3 | Concurrency & thread safety | ☑ | Services are still stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. The added `executionId` / `sourceScript` / `parentExecutionId` plumbing flows through method arguments only — no shared state introduced. No findings. |
|
||
| 4 | Error handling & resilience | ☑ | The poison-payload retry-forever path is the headline resilience issue (finding 018). `HttpClient.Timeout` not being set leaves the gateway's per-call round-trip cap clipped to the framework's 100s default whenever the configured `DefaultHttpTimeout` is larger — finding 019 (partial reopen of the `-002` contract). |
|
||
| 5 | Security | ☑ | Auth secrets still never logged; error bodies still truncated. `ApplyAuth` is silent on unknown `AuthType` / empty `AuthConfiguration` / malformed Basic config — finding 021 (fail-open is a real but bounded risk; recorded Low because misconfiguration is the precondition). Connection-string handling in `DatabaseGateway` reads from the entity verbatim and never logs it. |
|
||
| 6 | Performance & resource management | ☑ | Disposal paths from findings 005/010 still hold. The `IHttpClientFactory` name-keyed-options registration (finding 016 fix) creates a fresh `SocketsHttpHandler` per primary-handler build — acceptable because `IHttpClientFactory` recycles handlers. No new findings. |
|
||
| 7 | Design-document adherence | ☑ | The design doc enumerates GET/POST/PUT/DELETE but the code also serializes a body for PATCH (and accepts arbitrary HTTP verbs at runtime) — finding 023 (drift to be reconciled). The per-call timeout guarantee is partially defeated by the unset `HttpClient.Timeout` for option values > 100s — finding 019. |
|
||
| 8 | Code organization & conventions | ☑ | The `-016` fix replaced `ConfigureHttpClientDefaults` with a scoped `IConfigureNamedOptions<HttpClientFactoryOptions>` — verified clean, no new conventions issue. `internal virtual CreateConnection` (DatabaseGateway) and `internal InvokeHttpAsync` (ExternalSystemClient) are exposed via `InternalsVisibleTo` for tests — acceptable. No new findings. |
|
||
| 9 | Testing coverage | ☑ | The `JsonException` deserialization path for `DeliverBufferedAsync` is untested; the `JsonElementToParameterValue` `double`-downcast path is untested; `ApplyAuth`'s unknown-AuthType / empty-config / malformed-Basic branches are untested. Recorded under findings 018 / 020 / 021 rather than a standalone coverage finding. |
|
||
| 10 | Documentation & comments | ☑ | XML doc additions in `1eb6e97` are accurate and consistent. PATCH support is undocumented in the design doc (finding 023). The inline `ExternalSystemGateway-015` block-comment in `CachedCallAsync` (lines 126–133) and the equivalent in `DatabaseGateway.cs:106–113` now correctly describe the "treat 0 as unset" semantics. |
|
||
|
||
## Findings
|
||
|
||
### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Critical |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` and `CachedWriteAsync` enqueue messages under
|
||
`StoreAndForwardCategory.ExternalSystem` and `StoreAndForwardCategory.CachedDbWrite`.
|
||
`StoreAndForwardService.RegisterDeliveryHandler` is the only mechanism that lets the
|
||
S&F engine actually deliver a buffered message, and a repository-wide search shows it
|
||
is **never called for either category** anywhere in the codebase. Consequences:
|
||
|
||
1. On a transient failure, `EnqueueAsync` falls through to the "No handler registered
|
||
— buffer for later" branch (`StoreAndForwardService.cs:163`) and the message is
|
||
persisted.
|
||
2. During the retry sweep, `AttemptDeliveryAsync` (`StoreAndForwardService.cs:201`)
|
||
logs `"No delivery handler for category {Category}"` and returns without ever
|
||
removing or delivering the message.
|
||
|
||
The result is that every cached external call and cached DB write is silently
|
||
buffered forever and never delivered — a data-loss path for the exact "deferred
|
||
delivery is acceptable" use cases the design doc calls out (posting production data,
|
||
quality reports). The script also receives `WasBuffered: true` / a successful
|
||
`CachedWriteAsync` completion, so the failure is completely invisible.
|
||
|
||
**Recommendation**
|
||
|
||
Register delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
|
||
`StoreAndForwardCategory.CachedDbWrite` during host/site startup. The `ExternalSystem`
|
||
handler should deserialize the payload, re-resolve the system/method, and re-invoke
|
||
`InvokeHttpAsync`, returning `true`/`false`/throwing per the transient-vs-permanent
|
||
contract `EnqueueAsync` expects. The `CachedDbWrite` handler should execute the SQL
|
||
against the named connection. Add an integration test that buffers a message and
|
||
verifies it is delivered by a retry sweep.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. Delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
|
||
`CachedDbWrite` are now registered at site startup in `AkkaHostedService`, after
|
||
`StoreAndForwardService.StartAsync()`. Each handler resolves its consumer in a fresh DI
|
||
scope and calls a new `DeliverBufferedAsync`: `ExternalSystemClient.DeliverBufferedAsync`
|
||
re-resolves the system/method and re-invokes `InvokeHttpAsync`, and
|
||
`DatabaseGateway.DeliverBufferedAsync` executes the buffered SQL — each returning `true`
|
||
on success, `false` (park) when the target no longer exists or fails permanently, and
|
||
throwing on transient failure so the engine retries. `EnqueueAsync` gained an
|
||
`attemptImmediateDelivery` parameter; `CachedCallAsync` passes `false` so registering the
|
||
handler does not dispatch the request twice (the double-dispatch noted in
|
||
`ExternalSystemGateway-003`). Regression tests cover the success, target-removed and
|
||
transient-retry paths. Fixed by the commit whose message references
|
||
`ExternalSystemGateway-001`.
|
||
|
||
### ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
|
||
|
||
**Description**
|
||
|
||
The design doc states each external system definition specifies a timeout that
|
||
"applies to all method calls on that system" and "applies to the HTTP request
|
||
round-trip", and `ExternalSystemGatewayOptions.DefaultHttpTimeout` exists as a
|
||
fallback. In practice no timeout is ever configured. `ServiceCollectionExtensions`
|
||
calls `services.AddHttpClient()` with no per-named-client configuration, and
|
||
`InvokeHttpAsync` calls `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")`
|
||
without setting `client.Timeout` or passing a `CancellationToken` derived from a
|
||
timeout. `SendAsync` is therefore subject only to `HttpClient`'s default 100-second
|
||
timeout, regardless of the system definition or the configured `DefaultHttpTimeout`.
|
||
A slow or hung external system will block the calling Script Execution Actor far
|
||
longer than the operator configured, and the design's core error-handling guarantee
|
||
(timeout → transient classification) does not hold within the intended window.
|
||
|
||
There is also no `Timeout` field on `ExternalSystemDefinition` at all, so even a
|
||
correct implementation has nowhere to read the per-system value from — the entity is
|
||
missing the field the design requires.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `Timeout` (TimeSpan) field to `ExternalSystemDefinition` and have
|
||
`InvokeHttpAsync` enforce it — either by setting `client.Timeout` via a typed/named
|
||
`HttpClient` registration, or by linking a `CancellationTokenSource` with the
|
||
per-system (or `DefaultHttpTimeout`) timeout to the supplied `cancellationToken`
|
||
before `SendAsync`. Ensure the resulting `TaskCanceledException`/`TimeoutException`
|
||
is classified as transient.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). `InvokeHttpAsync` now enforces a call
|
||
timeout: `ExternalSystemClient` takes an `IOptions<ExternalSystemGatewayOptions>` and
|
||
links a `CancellationTokenSource(DefaultHttpTimeout)` with the caller's token before
|
||
`SendAsync` and the response-body read, so the design's "timeout applies to the HTTP
|
||
request round-trip" guarantee now holds within the configured window (default 30s)
|
||
instead of `HttpClient`'s default 100s. A timeout is reclassified as a
|
||
`TransientExternalSystemException`; a caller-initiated cancellation is distinguished
|
||
from a timeout and propagated as `OperationCanceledException` rather than being
|
||
swallowed as transient. Regression tests:
|
||
`Call_SlowSystem_TimesOutAsTransientErrorWithinConfiguredWindow` and
|
||
`Call_CallerCancellation_IsNotMisreportedAsTimeout`.
|
||
|
||
Note (partial scope): the per-*system* `Timeout` field on `ExternalSystemDefinition`
|
||
remains unimplemented — adding it requires a change to `ScadaLink.Commons`, which is
|
||
outside this module's edit scope. Until that entity field exists, the configured
|
||
`DefaultHttpTimeout` is the effective per-call limit for every system. A follow-up
|
||
against the Commons module should add the `Timeout` field and have `InvokeHttpAsync`
|
||
prefer it over the default. This is a tracked follow-up, not a regression.
|
||
|
||
### ExternalSystemGateway-003 — `CachedCall` double-dispatches the HTTP request
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` first calls `InvokeHttpAsync` directly (line 86). On a
|
||
`TransientExternalSystemException` it then calls `_storeAndForward.EnqueueAsync(...)`
|
||
(line 109). `StoreAndForwardService.EnqueueAsync` is **not** a pure enqueue — it
|
||
"Attempts immediate delivery" by invoking the registered delivery handler
|
||
(`StoreAndForwardService.cs:128-159`). If a delivery handler for the `ExternalSystem`
|
||
category is registered (as finding 001 recommends), the HTTP request will be executed
|
||
a **second time** synchronously inside `EnqueueAsync`, immediately after the first
|
||
attempt failed. For a transient failure that is actually a slow/overloaded system,
|
||
this doubles the load and — critically — if the original request did reach the
|
||
external system, the immediate retry produces a duplicate delivery before the script
|
||
even returns, worsening the idempotency hazard the design doc explicitly warns about.
|
||
|
||
**Recommendation**
|
||
|
||
Decide on one dispatch path. Either (a) have `CachedCall` not pre-invoke
|
||
`InvokeHttpAsync` and instead let `EnqueueAsync`'s immediate-delivery attempt be the
|
||
single first attempt (requires the handler to exist and to surface permanent vs
|
||
transient correctly); or (b) add an enqueue-only entry point to
|
||
`StoreAndForwardService` that skips the immediate-delivery attempt, and have
|
||
`CachedCall` use it after its own first attempt. Approach (a) is cleaner and removes
|
||
the duplicated logic.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). Re-triage: this finding was already fixed in
|
||
the codebase as a side effect of the `ExternalSystemGateway-001` fix and is no longer
|
||
reproducible against the current source. `StoreAndForwardService.EnqueueAsync` gained an
|
||
`attemptImmediateDelivery` parameter (recommendation approach (b)), and
|
||
`CachedCallAsync` passes `attemptImmediateDelivery: false` after its own first HTTP
|
||
attempt — so `EnqueueAsync` buffers the message for the background retry sweep without
|
||
re-invoking the registered delivery handler, eliminating the duplicate dispatch. A
|
||
dedicated regression test, `CachedCall_TransientFailure_DoesNotImmediatelyRedispatchViaRegisteredHandler`,
|
||
was added in this module's test suite: it registers a counting delivery handler, drives
|
||
a `CachedCall` whose HTTP attempt fails transiently, and asserts the handler is invoked
|
||
zero times during enqueue. The test was verified to fail if `attemptImmediateDelivery`
|
||
is flipped back to `true`.
|
||
|
||
### ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` and `CachedWriteAsync` pass the definition's `MaxRetries` /
|
||
`RetryDelay` to `EnqueueAsync` only when they are non-default
|
||
(`MaxRetries > 0 ? ... : null`, `RetryDelay > TimeSpan.Zero ? ... : null`), otherwise
|
||
falling back to the S&F defaults. The site-side repository that supplies these
|
||
definitions, `SiteExternalSystemRepository.MapExternalSystem`
|
||
(`src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194`), never
|
||
reads `MaxRetries`/`RetryDelay` from SQLite at all — the constructed entities always
|
||
have `MaxRetries == 0` and `RetryDelay == TimeSpan.Zero`. As a result, at sites the
|
||
per-system retry settings the design doc requires are *always* discarded and the
|
||
global S&F defaults are silently used instead. The `> 0` guard in the ESG also makes
|
||
a legitimately-configured `MaxRetries` of 0 ("never retry") indistinguishable from
|
||
"unset", so an operator cannot express "do not retry".
|
||
|
||
**Recommendation**
|
||
|
||
Within this module, drop the `> 0` / `> Zero` guards and pass the definition values
|
||
through directly (or use nullable fields on the entity to distinguish "unset"). The
|
||
companion fix in `SiteExternalSystemRepository` to actually map the retry columns
|
||
should be tracked against the SiteRuntime module.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `CachedCallAsync` and `CachedWriteAsync` now pass
|
||
the definition's `MaxRetries` to `EnqueueAsync` verbatim — the `> 0` guard is dropped, so
|
||
a legitimately-configured `MaxRetries` of 0 ("never retry") is honoured instead of being
|
||
collapsed to the S&F default. The `RetryDelay > TimeSpan.Zero` guard is deliberately
|
||
**kept**: `TimeSpan.Zero` is the entity default for an unconfigured field and a literal
|
||
zero-delay retry loop is not a valid configuration, so falling back to the S&F default
|
||
interval for an unset delay is correct (only `MaxRetries == 0` is a meaningful operator
|
||
choice). Regression test `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`
|
||
buffers a transient failure and asserts the buffered message carries `MaxRetries == 0`
|
||
rather than the S&F default; `CachedCall_TransientFailure_BuffersWithSystemRetrySettings`
|
||
additionally covers a non-default settings pass-through. The companion fix in
|
||
`SiteExternalSystemRepository.MapExternalSystem` to actually read the `MaxRetries` /
|
||
`RetryDelay` columns from SQLite remains a tracked follow-up against the SiteRuntime
|
||
module (outside this module's edit scope) — until then, sites still supply
|
||
`MaxRetries == 0`, which this fix now correctly honours as "never retry".
|
||
|
||
### ExternalSystemGateway-005 — `HttpRequestMessage` and `HttpResponseMessage` are not disposed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167` |
|
||
|
||
**Description**
|
||
|
||
`InvokeHttpAsync` creates an `HttpRequestMessage` (line 133) and receives an
|
||
`HttpResponseMessage` from `SendAsync` (line 155); neither is wrapped in a `using` nor
|
||
explicitly disposed. Both are `IDisposable` and own resources (the request's
|
||
`StringContent`, the response's content stream). Under the per-invocation call volume
|
||
of a busy site this produces avoidable pressure on the finalizer queue and can hold
|
||
socket/stream resources longer than necessary. The success path reads the content but
|
||
never disposes the response; the error path likewise reads `errorBody` and then throws
|
||
without disposing.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the request in `using var request = ...` and the response in
|
||
`using var response = ...` (or call `Dispose()` in a `finally`). Ensure disposal still
|
||
occurs on the exception paths.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now declares the request as
|
||
`using var request` and wraps all response handling in a `using (response)` block, so
|
||
both `IDisposable` instances (and the request's `StringContent` / the response content
|
||
stream) are released on the success path **and** on the permanent/transient
|
||
exception paths. Regression tests `Call_SuccessfulHttp_DisposesRequestAndResponse` and
|
||
`Call_PermanentFailure_StillDisposesRequestAndResponse` use a disposal-tracking
|
||
`HttpMessageHandler`/`HttpContent` and assert both the request and the response content
|
||
are disposed; both were verified to fail before the `using` wrappers were added.
|
||
|
||
### ExternalSystemGateway-006 — `BuildUrl` ignores path templates and appends a trailing slash for empty paths
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — partially re-triaged: trailing-slash bug fixed; path-templating sub-issue is a design decision (see Resolution) |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196` |
|
||
|
||
**Description**
|
||
|
||
`BuildUrl` does `baseUrl.TrimEnd('/') + "/" + path.TrimStart('/')`. When `method.Path`
|
||
is empty (a method that targets the base URL itself), this still appends a `/`,
|
||
producing `https://host/api/` which some servers treat as a different resource than
|
||
`https://host/api`. More importantly, the design doc shows method paths as templates
|
||
like `/recipes/{id}`, but `BuildUrl` performs no placeholder substitution — a `{id}`
|
||
token is sent literally in the URL and the corresponding parameter is instead appended
|
||
as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either
|
||
the design's path-template feature is unimplemented, or the doc is stale; in the
|
||
current code a method defined as `/recipes/{id}` will never produce a correct URL.
|
||
|
||
**Recommendation**
|
||
|
||
Decide whether path templating is in scope. If yes, implement `{name}` substitution
|
||
from `parameters` in `BuildUrl` and exclude substituted parameters from the query
|
||
string/body. If no, update the component design doc to remove the `/recipes/{id}`
|
||
example and state that paths are literal. Also avoid appending a trailing `/` when
|
||
`path` is empty.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). The **trailing-slash bug** is fixed: `BuildUrl`
|
||
now appends a `/`-joined path segment only when the method's path is non-empty after
|
||
trimming, so a method targeting the base URL itself produces `https://host/api` rather
|
||
than `https://host/api/`. Regression tests `Call_MethodWithEmptyPath_DoesNotAppendTrailingSlash`
|
||
and `Call_MethodWithPath_BuildsExpectedUrl` (asserting on the captured request URI)
|
||
cover the empty-path and normal-path cases; the empty-path test was verified to fail
|
||
before the fix.
|
||
|
||
Re-triage of the **path-templating sub-issue** (`{id}` placeholder substitution): this
|
||
is a genuine design decision, not a code bug, and it requires editing the component
|
||
design doc — both outside this module's edit scope (`src/`, `tests/`, this file only).
|
||
The current code treats method paths as literal strings and routes parameters to the
|
||
query string (GET/DELETE) or JSON body (POST/PUT); a method authored as `/recipes/{id}`
|
||
sends the `{id}` token verbatim. **Tracked follow-up / surfaced design question:** the
|
||
design owner must decide whether path templating is in scope — if yes, implement
|
||
`{name}` substitution in `BuildUrl` and exclude substituted params from the
|
||
query/body; if no, the `Component-ExternalSystemGateway.md` `/recipes/{id}` example must
|
||
be changed to a literal path. The trailing-slash defect (the concrete correctness bug
|
||
in this finding) is fully resolved.
|
||
|
||
### ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177` |
|
||
|
||
**Description**
|
||
|
||
On a non-success HTTP response, the full response body is read into `errorBody` and
|
||
embedded verbatim into the exception message (`$"HTTP {code} from {name}: {errorBody}"`),
|
||
which then flows into `ExternalCallResult.ErrorMessage` and back to the calling script,
|
||
and into Site Event Logging. An external system error page can be arbitrarily large
|
||
(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There
|
||
is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry
|
||
and error string returned to scripts. There is also no content-type check before
|
||
treating the body as text.
|
||
|
||
**Recommendation**
|
||
|
||
Truncate `errorBody` to a bounded length (e.g. 1–2 KB) before embedding it, and
|
||
consider logging the full body separately at debug level rather than returning it to
|
||
the script. Optionally only include the body when the content type is textual.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `InvokeHttpAsync` now truncates the external error
|
||
response body to `MaxErrorBodyChars` (2048) via a `Truncate` helper before embedding it
|
||
into the transient/permanent exception message — so a misbehaving or hostile endpoint
|
||
can no longer inflate every script-visible `ErrorMessage` and Site Event Logging entry
|
||
with a multi-megabyte body. When truncation occurs the message is suffixed with
|
||
`… [truncated, N chars total]` so the original size is still visible. Regression test
|
||
`Call_PermanentFailureWithHugeErrorBody_TruncatesErrorMessage` drives a 400 with a
|
||
500 KB body and asserts the resulting `ErrorMessage` is bounded (< 4096 chars); it was
|
||
verified to fail (500 040-char message) before the cap was added. Content-type
|
||
filtering was considered optional in the recommendation and was not implemented — the
|
||
size cap alone closes the inflation/disclosure vector.
|
||
|
||
### ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: root cause already fixed in current source (see Resolution) |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30`, `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159` |
|
||
|
||
**Description**
|
||
|
||
`ErrorClassifier.IsTransient(Exception)` returns `true` for `TaskCanceledException`
|
||
and `OperationCanceledException`. `HttpClient.SendAsync` throws `TaskCanceledException`
|
||
both when its internal timeout elapses *and* when the supplied `CancellationToken` is
|
||
cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting
|
||
down). Because `InvokeHttpAsync`'s `catch` filter treats all of these as transient, a
|
||
caller-initiated cancellation during a `CachedCall` will be misclassified as a
|
||
transient failure and the message will be buffered for retry — work the caller
|
||
explicitly asked to abandon. For a `Call`, a shutdown-time cancellation is reported to
|
||
the script as a "Transient error" rather than an `OperationCanceledException`.
|
||
|
||
**Recommendation**
|
||
|
||
In `InvokeHttpAsync`, check `cancellationToken.IsCancellationRequested` first and
|
||
rethrow `OperationCanceledException` (or let it propagate) before applying transient
|
||
classification. Only treat a cancellation as a timeout when the supplied token is
|
||
*not* the one that was cancelled.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). **Re-triage:** the root cause described — a
|
||
caller-initiated cancellation being misclassified as a transient failure — is **no
|
||
longer present in the current source** and is not reproducible. `InvokeHttpAsync`
|
||
already wraps both `SendAsync` and the response-body `ReadAsStringAsync` in ordered
|
||
`catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)`
|
||
filters that rethrow the caller's cancellation *before* the
|
||
`catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch is ever reached
|
||
(this was added alongside the `ExternalSystemGateway-002` timeout fix). A caller-cancel
|
||
therefore propagates as `OperationCanceledException` and is never buffered; only the
|
||
gateway's own timeout token reclassifies as transient.
|
||
|
||
`ErrorClassifier.IsTransient(Exception)` does still return `true` for
|
||
`TaskCanceledException`/`OperationCanceledException`, but that is **correct and
|
||
intentional**: a `TaskCanceledException` raised by an HTTP timeout *is* a genuine
|
||
transient failure, and the only caller (`InvokeHttpAsync:238`) is unreachable for a
|
||
caller-cancellation because the two preceding `when`-filtered catches intercept it
|
||
first. The transient-vs-cancel decision is contextual (which token fired) and cannot
|
||
be made from the exception type alone, which is exactly why the call site does it.
|
||
No source change was required. A regression guard,
|
||
`CachedCall_CallerCancellation_IsNotBufferedAsTransient`, was added: it cancels the
|
||
caller token mid-`CachedCall` and asserts an `OperationCanceledException` is thrown and
|
||
the S&F buffer remains empty (the cancelled work is not retried). The existing
|
||
`Call_CallerCancellation_IsNotMisreportedAsTimeout` covers the synchronous `Call` path.
|
||
|
||
### ExternalSystemGateway-009 — `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium — re-triaged: root cause subsumed by the ExternalSystemGateway-003 dispatch redesign (see Resolution) |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117` |
|
||
|
||
**Description**
|
||
|
||
`CachedCallAsync` assigns the result of `_storeAndForward.EnqueueAsync(...)` to
|
||
`sfResult` and then never reads it — it unconditionally returns
|
||
`new ExternalCallResult(true, null, null, WasBuffered: true)`. `EnqueueAsync` can
|
||
return `Success == false` (a permanent failure encountered during its
|
||
immediate-delivery attempt — `StoreAndForwardService.cs:142`) or `Buffered == false`
|
||
(delivered immediately). In both cases the ESG still reports the call as buffered and
|
||
successful to the script. A permanent failure surfaced by the S&F immediate attempt is
|
||
therefore silently lost instead of being returned to the script as the design requires
|
||
("On permanent failure (HTTP 4xx), the error is returned synchronously").
|
||
|
||
**Recommendation**
|
||
|
||
Inspect `sfResult`: if `Success == false` return an error `ExternalCallResult`; set
|
||
`WasBuffered` from `sfResult.Buffered` rather than hard-coding `true`. (This finding is
|
||
partly subsumed by the dispatch redesign in finding 003.)
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). **Re-triage:** the stated root cause — "a
|
||
permanent failure surfaced by `EnqueueAsync`'s immediate-delivery attempt is silently
|
||
lost" — **can no longer occur** in the current source, and the dead `sfResult` variable
|
||
the finding cites has already been removed. The `ExternalSystemGateway-003` fix changed
|
||
`CachedCallAsync` to call `EnqueueAsync` with `attemptImmediateDelivery: false`. With
|
||
that flag, `EnqueueAsync` never invokes the registered delivery handler: it skips the
|
||
immediate-delivery block entirely (so the `StoreAndForwardResult(false, …, …)`
|
||
permanent-failure return at `StoreAndForwardService.cs:147` is unreachable from this
|
||
caller) and unconditionally buffers, returning `Accepted: true, WasBuffered: true`
|
||
(`StoreAndForwardService.cs:180`). The `ExternalCallResult(true, null, null,
|
||
WasBuffered: true)` that `CachedCallAsync` returns is therefore now factually correct
|
||
in every reachable case — the message *is* buffered and there is no swallowed permanent
|
||
failure. Permanent HTTP 4xx failures are still surfaced synchronously, because
|
||
`CachedCallAsync` makes its own first HTTP attempt and catches
|
||
`PermanentExternalSystemException` *before* it ever reaches `EnqueueAsync`. No source
|
||
change was required beyond the `ExternalSystemGateway-003` redesign that already landed.
|
||
Coverage: `CachedCall_TransientFailure_BuffersWithSystemRetrySettings` asserts both
|
||
`result.WasBuffered == true` and that the message is genuinely present in the S&F buffer
|
||
(depth == 1), confirming the `WasBuffered: true` claim is not a lie; the existing
|
||
`CachedCall` permanent-failure path is exercised by `Call_Permanent400_ReturnsPermanentError`
|
||
semantics shared via `InvokeHttpAsync`.
|
||
|
||
### ExternalSystemGateway-010 — `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50` |
|
||
|
||
**Description**
|
||
|
||
`GetConnectionAsync` constructs `new SqlConnection(...)` and calls `await
|
||
connection.OpenAsync(...)`. If `OpenAsync` throws (unreachable server, bad
|
||
credentials, cancellation) the just-created `SqlConnection` instance is never disposed
|
||
— the exception propagates and the local reference is lost. While an unopened
|
||
`SqlConnection` is lightweight, over many failing calls this is an avoidable leak. The
|
||
design doc says `Database.Connection()` failures return an error to the script; the
|
||
current code lets a raw `SqlException` escape, which is acceptable, but the leak is
|
||
not.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the open in a try/catch that disposes the connection before rethrowing:
|
||
`try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). `GetConnectionAsync` now wraps `OpenAsync` in a
|
||
`try/catch` that calls `await connection.DisposeAsync()` before rethrowing, so a failed
|
||
open (unreachable server, bad credentials, cancellation) no longer leaks the
|
||
`SqlConnection`. Connection creation was extracted into an `internal virtual
|
||
CreateConnection(string)` factory so the failure path is unit-testable. Regression test
|
||
`GetConnection_OpenFails_DisposesConnectionBeforeRethrowing` substitutes a `DbConnection`
|
||
whose `OpenAsync` always throws and asserts the connection is disposed when the
|
||
exception propagates; it was verified to fail before the `try/catch` was added.
|
||
|
||
### ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:360-374`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:169-176` |
|
||
|
||
**Description**
|
||
|
||
`ResolveSystemAndMethodAsync` calls `GetAllExternalSystemsAsync()` and then
|
||
`GetMethodsByExternalSystemIdAsync()` and filters in memory on every single call;
|
||
`ResolveConnectionAsync` calls `GetAllDatabaseConnectionsAsync()` and filters in memory
|
||
on every cached write / connection request. At sites this hits the SQLite repository,
|
||
and `SiteExternalSystemRepository` re-reads and re-parses the methods JSON each time.
|
||
For a hot script path this is unnecessary repeated I/O and allocation. Definitions only
|
||
change on deployment, so they are eminently cacheable.
|
||
|
||
**Recommendation**
|
||
|
||
Add an in-memory cache of system/method/connection definitions keyed by name,
|
||
invalidated on artifact deployment. Alternatively use a name-keyed repository lookup
|
||
rather than fetch-all-then-filter.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). A cross-module change was explicitly
|
||
authorized, so the **name-keyed repository lookup** recommendation was applied — the
|
||
cleaner of the two options, and one that avoids the staleness hazard a
|
||
deployment-invalidated cache would introduce.
|
||
|
||
Three name-keyed methods were added to `IExternalSystemRepository`
|
||
(`ScadaLink.Commons`): `GetExternalSystemByNameAsync(name)`,
|
||
`GetMethodByNameAsync(externalSystemId, methodName)` and
|
||
`GetDatabaseConnectionByNameAsync(name)`. The connection lookup belongs on the same
|
||
interface because database connection definitions are already part of
|
||
`IExternalSystemRepository` (alongside `GetAllDatabaseConnectionsAsync` /
|
||
`GetDatabaseConnectionByIdAsync`), so the existing repository organization was
|
||
followed rather than introducing a new interface.
|
||
|
||
Both implementers of the interface were updated:
|
||
|
||
- `ScadaLink.ConfigurationDatabase.ExternalSystemRepository` — all three are genuine
|
||
server-side keyed queries (`FirstOrDefaultAsync(x => x.Name == name)`, the
|
||
method lookup additionally scoped by `ExternalSystemDefinitionId`), matching the
|
||
existing `GetMethodByNameAsync` / `GetListByNameAsync` / `GetSharedScriptByNameAsync`
|
||
convention in the other Central repositories.
|
||
- `ScadaLink.SiteRuntime.SiteExternalSystemRepository` — `GetExternalSystemByNameAsync`
|
||
and `GetDatabaseConnectionByNameAsync` are genuine single-row indexed SQLite queries
|
||
(`WHERE name = @name`; both tables have `name` as the PRIMARY KEY).
|
||
`GetMethodByNameAsync` resolves the named method from the parent system's
|
||
`method_definitions` JSON column; this still requires resolving the parent system
|
||
(one id→name scan), but the gateway's new call sequence performs **one** scan instead
|
||
of the previous **two** (`GetAllExternalSystemsAsync` + `GetMethodsByExternalSystemIdAsync`,
|
||
which itself scanned), so the site path is strictly better than before — noted as
|
||
the one place a residual scan remains, bounded by the deployed system count.
|
||
|
||
`ExternalSystemClient.ResolveSystemAndMethodAsync` and
|
||
`DatabaseGateway.ResolveConnectionAsync` were rewritten to call the keyed lookups; the
|
||
`GetAllExternalSystemsAsync` / `GetMethodsByExternalSystemIdAsync` /
|
||
`GetAllDatabaseConnectionsAsync` + in-memory `FirstOrDefault` filtering is gone from
|
||
both hot paths.
|
||
|
||
Regression tests (TDD — written first, verified failing/not-compiling before the
|
||
implementation, then confirmed green; one was additionally verified to fail when the
|
||
keyed query is deliberately broken):
|
||
|
||
- ConfigurationDatabase (`RepositoryCoverageTests`):
|
||
`GetExternalSystemByName_ReturnsMatchingRow`,
|
||
`GetExternalSystemByName_MissingName_ReturnsNull`,
|
||
`GetMethodByName_ReturnsMethodScopedToParentSystem`,
|
||
`GetMethodByName_MissingMethod_ReturnsNull`,
|
||
`GetDatabaseConnectionByName_ReturnsMatchingRow`,
|
||
`GetDatabaseConnectionByName_MissingName_ReturnsNull`.
|
||
- SiteRuntime (`SiteRepositoryTests`):
|
||
`ExternalSystemRepository_GetByName_ReturnsMatchingDefinition`,
|
||
`ExternalSystemRepository_GetByName_MissingName_ReturnsNull`,
|
||
`ExternalSystemRepository_GetMethodByName_ResolvesScopedToSystem`,
|
||
`DatabaseConnectionRepository_GetByName_ReturnsMatchingDefinition`.
|
||
- ExternalSystemGateway: the existing `ExternalSystemClientTests` /
|
||
`DatabaseGatewayTests` resolution stubs were migrated to the keyed methods (via
|
||
`StubResolution` / `StubConnection` helpers), so the full gateway suite now exercises
|
||
and protects the keyed-lookup resolution path.
|
||
|
||
`dotnet build ScadaLink.slnx` is clean; `ScadaLink.ExternalSystemGateway.Tests` (54),
|
||
`ScadaLink.ConfigurationDatabase.Tests` (106), `ScadaLink.SiteRuntime.Tests` (196) and
|
||
`ScadaLink.Commons.Tests` (226) all pass.
|
||
|
||
### ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; `_logger` is injected but unused
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22` |
|
||
|
||
**Description**
|
||
|
||
The design doc states permanent failures are "Logged to Site Event Logging", but
|
||
`InvokeHttpAsync` performs no logging on the permanent-failure path. In fact the
|
||
injected `ILogger<ExternalSystemClient>` and `ILogger<DatabaseGateway>` fields are
|
||
never used at all in either class. Either the logging is expected to happen in the
|
||
caller (Script Execution Actor) — in which case the design doc is imprecise about
|
||
where — or it is missing. Separately, `IsTransient(HttpStatusCode)` treats any
|
||
non-success, non-(5xx/408/429) status as permanent without an explicit comment, which
|
||
is a reasonable default but undocumented.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `_logger.LogWarning` on the permanent-failure path (and a debug log on
|
||
transient), or clarify in the design doc that Site Event Logging capture is the
|
||
caller's responsibility and remove the unused `_logger` fields. Add a comment in
|
||
`ErrorClassifier` documenting the "default to permanent" behaviour.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: the
|
||
"`_logger` injected but unused" claim is **stale** — both loggers are already used (the
|
||
`DeliverBufferedAsync` retry-sweep handlers added by `ExternalSystemGateway-001` log
|
||
park/error events). The genuine remaining gap — `InvokeHttpAsync` performing no logging
|
||
on the HTTP-failure paths — is now fixed: `InvokeHttpAsync` emits a
|
||
`_logger.LogWarning` on the permanent-failure path (status code, system, method,
|
||
truncated error body) so a permanent failure is visible in Site Event Logging as the
|
||
design requires, and a `_logger.LogDebug` on the transient-failure path (transient
|
||
failures are normal retry/S&F operation and must not be noisy at warning level). A
|
||
documentation comment was added to `ErrorClassifier.IsTransient(HttpStatusCode)`
|
||
explaining the "any other non-success status defaults to permanent" behaviour and why
|
||
permanent is the safe default. Regression tests: `Call_PermanentFailure_LogsAWarning`
|
||
(asserts a warning carrying the system name is emitted; verified to fail before the
|
||
`LogWarning` was added) and `Call_TransientFailure_DoesNotLogAtWarningOrAbove` (guards
|
||
against over-logging transient failures).
|
||
|
||
### ExternalSystemGateway-013 — `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
|
||
|
||
**Description**
|
||
|
||
`ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem` (default 10) and
|
||
`DefaultHttpTimeout` (default 30s) are bound from configuration but neither is read
|
||
anywhere. `AddHttpClient()` registers the default factory with no
|
||
`ConfigurePrimaryHttpMessageHandler`/`SocketsHttpHandler` `MaxConnectionsPerServer` and
|
||
no `Timeout`, so both options have no effect. An operator setting these values gets
|
||
them silently ignored — a misleading configuration surface (`DefaultHttpTimeout` is
|
||
also referenced by finding 002).
|
||
|
||
**Recommendation**
|
||
|
||
Either wire the options into a named/typed `HttpClient` registration (set
|
||
`MaxConnectionsPerServer` on the primary handler, set `Timeout`), or remove the unused
|
||
options to avoid implying behaviour that does not exist.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text:
|
||
`DefaultHttpTimeout` is **no longer unused** — it became the effective per-call HTTP
|
||
round-trip limit when `ExternalSystemGateway-002` was fixed (`InvokeHttpAsync` builds a
|
||
linked `CancellationTokenSource(DefaultHttpTimeout)`). The genuinely-unused option,
|
||
`MaxConcurrentConnectionsPerSystem`, is now wired in: `AddExternalSystemGateway` adds a
|
||
`ConfigureHttpClientDefaults` registration that supplies a `SocketsHttpHandler` whose
|
||
`MaxConnectionsPerServer` is bound from the option, so it applies to every per-system
|
||
named client (`ExternalSystem_{name}`) the gateway creates rather than being silently
|
||
ignored. Regression test
|
||
`ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsAppliedToTheNamedHttpClientPrimaryHandler`
|
||
builds the DI container, resolves the named client's handler chain via
|
||
`IHttpMessageHandlerFactory`, walks to the primary handler and asserts
|
||
`SocketsHttpHandler.MaxConnectionsPerServer` equals the configured value; it was verified
|
||
to fail before the wiring was added.
|
||
|
||
### ExternalSystemGateway-014 — Cached-call buffering path and `DatabaseGateway` are untested
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1`, `tests/ScadaLink.ExternalSystemGateway.Tests/DatabaseGatewayTests.cs` |
|
||
|
||
**Description**
|
||
|
||
`ExternalSystemClientTests` covers system/method not-found, success, transient 500 and
|
||
permanent 400 for `CallAsync`, plus `CachedCall` not-found and success. It does **not**
|
||
cover: the `CachedCall` transient-failure → S&F buffering branch (the most
|
||
behaviour-rich path, including the `_storeAndForward == null` fallback and `WasBuffered`
|
||
semantics), the `CachedCall` permanent-failure branch, connection-exception
|
||
classification (`HttpRequestException` thrown by the handler), `BuildUrl` query-string
|
||
construction, and `ApplyAuth` for the apikey/basic variants. There is **no test file
|
||
for `DatabaseGateway`** at all — `GetConnectionAsync` not-found, `CachedWriteAsync`
|
||
not-found, and the `_storeAndForward == null` guard are entirely uncovered. The
|
||
`MockHttpMessageHandler` also does not assert request URL/headers/body, so auth and
|
||
URL construction are unverified.
|
||
|
||
**Recommendation**
|
||
|
||
Add tests for the `CachedCall` transient/buffering paths (with a substituted S&F
|
||
service), `DatabaseGateway` not-found and null-S&F guards, and `BuildUrl`/`ApplyAuth`
|
||
by asserting on the captured `HttpRequestMessage` in the mock handler.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Partial re-triage of the finding text: a number
|
||
of the listed gaps were **already closed** by tests added when findings 001–010 were
|
||
fixed — `DatabaseGatewayTests.cs` now exists (not-found, null-S&F guard, buffered-write
|
||
delivery), and the `CachedCall` transient/buffering and permanent paths are covered by
|
||
the `ExternalSystemGateway-001/003/004/008` regression tests. The remaining genuine
|
||
coverage gaps are now closed with new tests:
|
||
|
||
- `Call_GetWithParameters_AppendsEscapedQueryString` — `BuildUrl` GET query-string
|
||
construction, asserting on the captured request URI and that an `&` inside a value is
|
||
percent-escaped.
|
||
- `Call_PostWithParameters_SendsJsonBody` — POST JSON-body construction.
|
||
- `Call_ApiKeyAuthWithDefaultHeader_SendsXApiKeyHeader`,
|
||
`Call_ApiKeyAuthWithCustomHeader_SendsNamedHeader`,
|
||
`Call_BasicAuth_SendsBase64AuthorizationHeader` — `ApplyAuth` for all three auth
|
||
variants, asserting on the captured request headers.
|
||
- `Call_ConnectionError_IsClassifiedAsTransient` — a connection-level
|
||
`HttpRequestException` is classified transient.
|
||
- `CachedWrite_BuffersTheWriteWithConnectionRetrySettings` and
|
||
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` — the `DatabaseGateway`
|
||
`CachedWrite` happy-path enqueue against a real S&F service.
|
||
|
||
The shared `RequestCapturingHandler` test helper was extended to capture request
|
||
headers and body so URL/auth/body construction is now verified, not just status codes.
|
||
These are new-coverage tests against already-correct behaviour, so they pass on the
|
||
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
|
||
against regression.
|
||
|
||
### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
|
||
|
||
**Description**
|
||
|
||
The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
|
||
that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
|
||
to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
|
||
comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
|
||
tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
|
||
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
|
||
`MaxRetries` of `0` means **"never retry"**.
|
||
|
||
That is the opposite of what the Store-and-Forward engine actually does with the
|
||
value. `EnqueueAsync` stores the passed `maxRetries` directly into
|
||
`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
|
||
doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
|
||
(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
|
||
`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
|
||
(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
|
||
every sweep, so the message is **never parked and is retried forever**.
|
||
|
||
Consequences for a `CachedCall`/`CachedWrite` against a system or connection
|
||
configured with `MaxRetries = 0`:
|
||
|
||
1. A transiently-failing message that the operator intended never to retry is instead
|
||
retried on every sweep indefinitely, accumulating in the buffer and repeatedly
|
||
re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
|
||
the idempotency note in the design doc warns about.
|
||
2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
|
||
describe that as "honoured" — they verify the persisted column, never the resulting
|
||
sweep behaviour, so they lock in the broken semantics.
|
||
3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
|
||
open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
|
||
cached write at every site currently buffers as retry-forever. Before the ESG-004
|
||
fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
|
||
applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
|
||
common path.
|
||
|
||
**Recommendation**
|
||
|
||
Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
|
||
Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
|
||
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
|
||
that "never retry" then needs a different representation such as a nullable field or a
|
||
`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
|
||
explicit no-retry path — e.g. do not enqueue at all on transient failure when
|
||
`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
|
||
`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
|
||
means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
|
||
outcome (parked / not retried), not just the stored column value.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: the S&F engine treats a stored
|
||
`MaxRetries == 0` as "no limit / retry forever" (`StoreAndForwardMessage.MaxRetries`
|
||
doc "0 = no limit"; sweep guard `MaxRetries > 0 && RetryCount >= MaxRetries`), while
|
||
the entity's non-nullable `int MaxRetries` defaults to `0` — so passing it verbatim
|
||
buffered every cached call/write as an unbounded retry loop. Fix (ESG-side only,
|
||
recommendation (a)): `CachedCallAsync` and `CachedWriteAsync` now pass
|
||
`MaxRetries > 0 ? MaxRetries : null`, so an entity `0` is treated as "unset" and the
|
||
bounded S&F `DefaultMaxRetries` applies; the misleading "0 = never retry" inline
|
||
comments were corrected. The two `ZeroMaxRetries...` tests were rewritten to
|
||
`CachedCall_TransientFailure_ZeroMaxRetriesIsTreatedAsUnsetNotRetryForever` /
|
||
`CachedWrite_ZeroMaxRetriesIsTreatedAsUnsetNotRetryForever`, asserting the buffered
|
||
message carries the bounded default (99) and never `0`.
|
||
|
||
### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
|
||
|
||
**Description**
|
||
|
||
The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
|
||
calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
|
||
The inline comment claims this "applies to the dynamically-named clients created by
|
||
`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
|
||
adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
|
||
anywhere in the host, regardless of name.
|
||
|
||
The Host registers the External System Gateway alongside other components that also
|
||
use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
|
||
and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
|
||
present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
|
||
has its **primary handler replaced** by a `SocketsHttpHandler` whose
|
||
`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
|
||
|
||
1. Silently caps unrelated clients at a value owned by, and named for, a different
|
||
component — an operator tuning the ESG option unknowingly throttles Microsoft 365
|
||
OAuth2 token acquisition.
|
||
2. Overrides/discards any primary-handler configuration those other components add for
|
||
their own clients (e.g. a custom handler, proxy, or certificate settings).
|
||
|
||
This is a leaky, surprising side effect for what the option claims to be a per-ESG
|
||
setting; `ConfigureHttpClientDefaults` should not be used to express a single
|
||
component's policy.
|
||
|
||
**Recommendation**
|
||
|
||
Scope the handler configuration to the gateway's own clients only. The ESG already
|
||
creates per-system clients with a deterministic name pattern
|
||
(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
|
||
factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
|
||
on that registration instead of on the global defaults. If a name-pattern registration
|
||
is impractical, document the global effect explicitly and rename the option, but the
|
||
preferred fix is to stop using `ConfigureHttpClientDefaults`.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: `ConfigureHttpClientDefaults` is
|
||
process-global and replaced the primary handler of every `IHttpClientFactory` client
|
||
in the host, leaking the ESG connection cap onto unrelated clients. Fix: the global
|
||
`ConfigureHttpClientDefaults` registration was replaced with an
|
||
`IConfigureNamedOptions<HttpClientFactoryOptions>` (`GatewayHttpClientConfigurator`)
|
||
that applies the `SocketsHttpHandler`/`MaxConnectionsPerServer` cap only to clients
|
||
whose name starts with `ExternalSystem_` (the gateway's own per-system clients), so
|
||
clients owned by other components keep their own (or the framework default) primary
|
||
handler. Regression test
|
||
`ServiceWiringTests.MaxConcurrentConnectionsPerSystem_IsNotAppliedToNonGatewayHttpClients`
|
||
asserts a non-gateway client does not inherit the cap while the gateway client still
|
||
does; it was verified to fail before the fix.
|
||
|
||
### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
|
||
|
||
**Description**
|
||
|
||
In `BuildUrl`, the GET/DELETE query-string branch is entered when
|
||
`parameters != null && parameters.Count > 0`, but the projection then filters out
|
||
null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
|
||
method is invoked with a non-empty parameter dictionary whose values are *all* null,
|
||
`queryString` is the empty string and the code still executes `url += "?" + queryString`,
|
||
producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
|
||
tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
|
||
on some endpoints, and makes captured request URLs harder to read in logs.
|
||
|
||
**Recommendation**
|
||
|
||
Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
|
||
string first and check it), so a method whose effective parameter set is empty
|
||
produces a clean URL identical to the no-parameters case.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: `BuildUrl` appended `"?" + queryString`
|
||
whenever the GET/DELETE parameter dictionary was non-empty, even when every value
|
||
was null and `queryString` was the empty string, yielding a bare trailing `?`. Fix:
|
||
`BuildUrl` now appends `"?" + queryString` only when `queryString.Length > 0`, so a
|
||
method whose effective parameter set is empty produces a URL identical to the
|
||
no-parameters case. Regression test
|
||
`Call_GetWithAllNullParameters_DoesNotAppendTrailingQuestionMark` asserts the
|
||
captured request URI has no trailing `?`; it was verified to fail before the fix.
|
||
|
||
### ExternalSystemGateway-018 — `DeliverBufferedAsync` lets `JsonException` propagate, turning a corrupt buffered row into a permanent retry-forever poison message
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:176`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:151` |
|
||
|
||
**Resolution** — Wrapped the `JsonSerializer.Deserialize<...>(message.PayloadJson)` call in both `ExternalSystemClient.DeliverBufferedAsync` and `DatabaseGateway.DeliverBufferedAsync` in a `try`/`catch (JsonException)` block. A `JsonException` is by definition permanent (the same payload bytes always deserialize identically), so the catch branch logs at `LogError` and returns `false`, parking the message via the S&F engine instead of letting it throw and be retried as a transient failure. Regression tests `DeliverBuffered_MalformedJsonPayload_ReturnsFalseSoMessageParks` were added to both `ExternalSystemClientTests` and `DatabaseGatewayTests` — each feeds a truncated `PayloadJson` to the handler and asserts `delivered == false` and that no exception escapes.
|
||
|
||
**Description**
|
||
|
||
Both `ExternalSystemClient.DeliverBufferedAsync` and `DatabaseGateway.DeliverBufferedAsync`
|
||
begin with an unguarded `JsonSerializer.Deserialize<...>(message.PayloadJson)`:
|
||
|
||
```csharp
|
||
var payload = JsonSerializer.Deserialize<CachedCallPayload>(message.PayloadJson);
|
||
if (payload == null || string.IsNullOrEmpty(payload.SystemName) || ...) {
|
||
_logger.LogError("... unreadable payload; parking.");
|
||
return false;
|
||
}
|
||
```
|
||
|
||
The "unreadable payload; parking" branch is only entered when `Deserialize` *succeeds*
|
||
and produces a null / partially-empty object. If `PayloadJson` is **malformed JSON** —
|
||
the column was truncated mid-write, an older payload schema is being deserialized into a
|
||
newer record, or storage corruption occurred — `Deserialize` throws `JsonException`
|
||
before that check is ever reached. The exception propagates out of the delivery handler.
|
||
|
||
The Store-and-Forward retry loop treats *any* thrown exception from a delivery handler
|
||
as a transient failure (only a returned `false` parks the message); see
|
||
`StoreAndForwardService.RetryMessageAsync`. Combined with the `MaxRetries == 0` →
|
||
"unset → bounded default" fix from `-015`, the resulting behaviour is:
|
||
|
||
1. Corrupt payload arrives in the buffer.
|
||
2. Every retry sweep deserializes, throws `JsonException`, increments `RetryCount`.
|
||
3. The message is retried until `RetryCount >= MaxRetries`, then parked — *only* if
|
||
`MaxRetries > 0` is configured (which `-015` already established is not the default
|
||
site configuration today). With the bounded S&F default it does eventually park, but
|
||
it park-loops noisily for `DefaultMaxRetries` iterations first; without that bound it
|
||
retries forever.
|
||
4. The script is unaware — the cached call was returned `WasBuffered: true` long ago.
|
||
|
||
This is the same "poison message buffered forever" class of hazard that
|
||
`ExternalSystemGateway-001` (no-handler) and `ExternalSystemGateway-015` (MaxRetries==0)
|
||
already removed for their own causes; corrupt JSON is an alternative arrival path into
|
||
the same bad state.
|
||
|
||
The `DatabaseGateway.DeliverBufferedAsync` path has the same shape and the same defect:
|
||
`JsonSerializer.Deserialize<CachedWritePayload>` at line 151 is not guarded.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap the `Deserialize` call in a `try/catch (JsonException)` block in both
|
||
`DeliverBufferedAsync` methods. A `JsonException` is by definition a permanent failure —
|
||
re-running the same deserialization against the same payload will produce the same
|
||
exception — so the catch should log at `LogError` and **return `false`** so the S&F
|
||
engine parks the message rather than retrying. Add regression tests that feed a
|
||
malformed `PayloadJson` to each handler and assert `delivered == false` (i.e. the
|
||
message parks) and that no exception escapes the handler.
|
||
|
||
### ExternalSystemGateway-019 — `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:226,257-264`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:90-102` |
|
||
|
||
**Resolution (2026-05-28):** Set `client.Timeout = Timeout.InfiniteTimeSpan` immediately after `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")` in `ExternalSystemClient.InvokeHttpAsync`, disabling the framework's 100 s default so the per-call `CancellationTokenSource(_options.DefaultHttpTimeout)` linked CTS already built below is the sole timeout source. An operator-configured `DefaultHttpTimeout` greater than 100 s is now honoured verbatim instead of being silently clipped and misclassified as a transient "connection error". Kept the fix local to the allowed file (`ExternalSystemClient.cs`) rather than touching `ServiceCollectionExtensions.cs`/`GatewayHttpClientConfigurator`. Regression test `Call_DisablesHttpClientFrameworkTimeoutSoLongTimeoutsArentClipped` asserts the rented client starts with the framework's 100 s default and is set to `Timeout.InfiniteTimeSpan` after `InvokeHttpAsync` runs.
|
||
|
||
**Description**
|
||
|
||
The `-002` fix enforces the per-call timeout via a linked `CancellationTokenSource`
|
||
built from `_options.DefaultHttpTimeout` and passed into `SendAsync`. That correctly
|
||
caps every call to *at most* the configured value when `DefaultHttpTimeout` ≤ 100s.
|
||
However, `HttpClient.Timeout` (the framework default) is never set on either the named
|
||
client or its primary handler — the `GatewayHttpClientConfigurator` only sets
|
||
`MaxConnectionsPerServer`. `HttpClient.Timeout` defaults to **100 seconds**, and
|
||
`SendAsync` enforces it internally by cancelling its own private CTS, raising a
|
||
`TaskCanceledException` from `SendAsync` *without* cancelling either the caller's token
|
||
or the gateway's `timeoutCts`.
|
||
|
||
Consequences when an operator configures `DefaultHttpTimeout` to anything > 100s
|
||
(a legitimate setting for external systems with long-running endpoints — recipe
|
||
exports, large queries):
|
||
|
||
1. The gateway's `timeoutCts` (e.g. 5 minutes) has not yet fired.
|
||
2. `HttpClient.Timeout` fires at 100s, `SendAsync` throws.
|
||
3. Neither `when (cancellationToken.IsCancellationRequested)` nor
|
||
`when (timeoutCts.IsCancellationRequested)` matches, so the exception falls into
|
||
the generic `catch (Exception ex) when (ErrorClassifier.IsTransient(ex))` branch
|
||
(line 277) and is re-thrown as a `TransientExternalSystemException` with the
|
||
message `"Connection error to {Name}: A task was canceled."` — misattributing a
|
||
timeout as a connection error.
|
||
4. The configured 5-minute round-trip window the design doc promises ("Each external
|
||
system definition specifies a timeout that applies to all method calls on that
|
||
system" / "applies to the HTTP request round-trip") is silently overridden.
|
||
|
||
The opposite case (`DefaultHttpTimeout` < 100s) is the only one the `-002` regression
|
||
test exercises (200ms), so the defect is not caught by the existing suite.
|
||
|
||
**Recommendation**
|
||
|
||
Set `HttpClient.Timeout = Timeout.InfiniteTimeSpan` on the gateway's named clients via
|
||
the existing `GatewayHttpClientConfigurator` (delegate `HttpClientActions` rather than
|
||
just `HttpMessageHandlerBuilderActions`), so the cancellation-token mechanism is the
|
||
sole timeout source. The linked `timeoutCts` then reliably enforces
|
||
`DefaultHttpTimeout` for every value, and the timeout-vs-cancellation classification at
|
||
lines 266–276 stays accurate. Add a regression test that configures `DefaultHttpTimeout`
|
||
to ~150s, hangs the handler, and asserts the call times out at the configured value
|
||
and produces a `"Timeout calling..."` (not `"Connection error to..."`) error.
|
||
|
||
### ExternalSystemGateway-020 — `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:185-193` |
|
||
|
||
**Resolution (2026-05-28):** `JsonElementToParameterValue` now probes `TryGetInt64` → `TryGetDecimal` → `GetDouble`, so a JSON number that fits in `decimal` materialises as a `decimal` (preserving the script's authored precision on cached-write retries) and only genuinely out-of-decimal-range values fall through to `double`. Regression test `JsonElementToParameterValue_DecimalShapedNumber_PreservesPrecisionViaDecimal` round-trips `1234567890.1234567890` through a `JsonElement` and asserts the result is a `decimal` carrying the original precision; companion tests guard the long-fast-path and the out-of-range-double fallback.
|
||
|
||
**Description**
|
||
|
||
`DatabaseGateway.JsonElementToParameterValue` materialises the buffered cached-write
|
||
SQL parameter values during a retry-sweep delivery:
|
||
|
||
```csharp
|
||
private static object JsonElementToParameterValue(JsonElement element) => element.ValueKind switch
|
||
{
|
||
JsonValueKind.String => (object?)element.GetString() ?? DBNull.Value,
|
||
JsonValueKind.Number => element.TryGetInt64(out var l) ? l : element.GetDouble(),
|
||
...
|
||
};
|
||
```
|
||
|
||
For a JSON number, the helper attempts `Int64` first and otherwise returns a `double`.
|
||
There is no `decimal` branch. The immediate-attempt path is unaffected — `CachedWriteAsync`
|
||
on the original call serializes the script-provided typed parameters via
|
||
`JsonSerializer.Serialize(new { ConnectionName, Sql, Parameters = parameters })` and
|
||
executes the SQL directly outside this code path. But the **retry path** runs through
|
||
`DeliverBufferedAsync` → `JsonElementToParameterValue`, so a script that submitted
|
||
a `decimal` value (e.g. `123.4567890123m`) gets:
|
||
|
||
1. Immediate attempt: `decimal` parameter, full precision (or, more accurately, the
|
||
value never enters this helper because cached writes today never re-execute on the
|
||
immediate path — but on the retry path it does).
|
||
2. Retry attempt(s) after a transient failure: the value is deserialized as a JSON
|
||
number, fails `TryGetInt64`, and is downcast to `double` — which has ~15–17 digits
|
||
of precision against `decimal`'s 28–29. A SQL column of type `decimal(18, 6)` or
|
||
`numeric` receives a value that has been truncated to `double` precision before
|
||
parameter binding.
|
||
|
||
Two further consequences worth recording:
|
||
|
||
- The downcast is **silent** — there is no log, no error, and the cached-write
|
||
acknowledgement to the script has long since happened. Data drift between a
|
||
same-call immediate-success delivery and a same-call retry delivery is the worst
|
||
shape of "looks like the right value but isn't" defect.
|
||
- For SCADA telemetry (process variables, totals, currency-denominated quality
|
||
reports) `decimal` is the correct CLR type and `double`'s representation error
|
||
changes the persisted value.
|
||
|
||
**Recommendation**
|
||
|
||
Replace the `Number` branch with a precision-preserving cascade — try `Int64`, then
|
||
`decimal` (`element.TryGetDecimal(out var d) ? d : element.GetDouble()`), and only
|
||
fall back to `double` when even `decimal` fails. Add a regression test against
|
||
`DatabaseGateway.DeliverBufferedAsync` that buffers a write with a high-precision
|
||
`decimal` parameter, drives the delivery, and asserts the SQL parameter bound is a
|
||
`decimal` (or compares the round-tripped value to the original at the parameter level)
|
||
rather than a `double` with truncated precision. The same Number-branch decision should
|
||
be reviewed against `JsonValueKind.True`/`False`/`Null` (currently fine) and a string
|
||
that happens to encode a number (already correctly returns `string`).
|
||
|
||
### ExternalSystemGateway-021 — `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:385-415` |
|
||
|
||
**Resolution (2026-05-28):** `ApplyAuth` is now an instance method that uses
|
||
the existing `_logger`. Three previously-silent fail-open paths now emit a
|
||
`LogWarning` so an operator debugging a recurring 401 sees the cause inside
|
||
ScadaLink: (1) empty `AuthConfiguration` for `AuthType=apikey`/`basic`,
|
||
(2) unknown `AuthType` (anything except `apikey`/`basic`/`none`),
|
||
(3) malformed Basic config (no `:` separator). The `AuthConfiguration`
|
||
value is NEVER included in the log message. `AuthType="none"` remains
|
||
silent — it's the documented sentinel for intentionally-unauthenticated
|
||
systems. Behaviour is otherwise unchanged: the request still goes out
|
||
(never block on auth-config issues), the failure mode is just now visible.
|
||
|
||
**Description**
|
||
|
||
`ApplyAuth` has three fail-open paths that all result in an HTTP request being sent
|
||
**without** the credential the operator configured:
|
||
|
||
1. Line 387 — `if (string.IsNullOrEmpty(system.AuthConfiguration)) return;` returns
|
||
early regardless of `AuthType`. A system entity with `AuthType = "apikey"` but an
|
||
empty `AuthConfiguration` (e.g. the secret column failed to deploy, or the
|
||
protector key changed and decryption produced `""`) sends every request with no
|
||
`X-API-Key` header — the gateway is silent.
|
||
2. The `switch` has no `default` arm. A system entity with `AuthType = "bearer"`,
|
||
`"oauth2"`, a typo like `"ApiKey "` (trailing space) or even `"none"` falls off the
|
||
`switch` and the request is sent without any auth header — again silent.
|
||
3. Line 408 — `if (basicParts.Length == 2)` skips the auth attach when
|
||
`AuthConfiguration` for `basic` lacks a `:` separator. The request is sent with no
|
||
`Authorization` header.
|
||
|
||
Effectively the gateway treats every misconfiguration as "send anonymously" and
|
||
relies on the remote system rejecting it with a 401/403. That is a defensible default
|
||
on its own, but combined with `-007`'s 2 KB error-body cap and the fact that no audit
|
||
or warning is emitted, an operator debugging "why does my external system always
|
||
return 401" has nothing to go on inside ScadaLink — the gateway never says it failed
|
||
to apply auth. For `AuthType = "none"` (the design's expected sentinel for
|
||
unauthenticated systems) the fall-through is correct; the failure mode is misconfig.
|
||
|
||
**Recommendation**
|
||
|
||
Add a `default:` arm to the `switch` that logs `_logger.LogWarning(...)` naming the
|
||
unknown `AuthType` and the system, and emit a similar warning when
|
||
`AuthConfiguration` is empty for an `AuthType` of `"apikey"` or `"basic"` (those
|
||
require a value; `"none"` does not). For Basic auth specifically, the
|
||
`basicParts.Length != 2` branch should also warn. Do **not** include the
|
||
`AuthConfiguration` value in the log message — secrets must stay out of the log
|
||
(consistent with the existing module). A small set of `ApplyAuth` unit tests
|
||
verifying the warning emission and that no `Authorization` / `X-API-Key` header is
|
||
ever leaked in the warning text would close the test gap as well.
|
||
|
||
### ExternalSystemGateway-022 — `new HttpMethod(method.HttpMethod)` accepts any string at runtime; an invalid HTTP verb fails only at call time
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:233` |
|
||
|
||
**Resolution (2026-05-28):** Added a `ValidateHttpMethod` helper called at the top of `InvokeHttpAsync` that rejects any verb outside the documented `GET/POST/PUT/PATCH/DELETE` allowlist (matching ESG-023's design-doc reconciliation) with a clear `ArgumentException` naming the offending verb. Allowlist is a `HashSet<string>` with `OrdinalIgnoreCase` so the operator-authored entity column is case-insensitive. Regression tests `Call_UnsupportedHttpMethod_ThrowsArgumentException` (Theory: FOO/DLETE/GIT/OPTIONS/HEAD) and `Call_DocumentedHttpMethod_IsAccepted` (Theory: GET/get/Post/PATCH/delete) cover the rejection and the case-insensitive accept paths.
|
||
|
||
**Description**
|
||
|
||
`InvokeHttpAsync` constructs the request method directly from the string column:
|
||
`new HttpRequestMessage(new HttpMethod(method.HttpMethod), url)`. `System.Net.Http.HttpMethod(string)`
|
||
performs only a token-character validation (it rejects whitespace and control chars
|
||
but accepts arbitrary non-standard tokens like `"FOO"` or `"GIT"`). The body-vs-query
|
||
selection at lines 239–250 explicitly checks for POST/PUT/PATCH; for any other
|
||
non-standard verb (`"FOO"`) the parameters silently go to neither body nor query
|
||
string and the request is dispatched anyway.
|
||
|
||
The design doc enumerates GET/POST/PUT/DELETE as the supported set. There is no
|
||
validation at deployment time, at definition save time, or at gateway
|
||
resolution time that `method.HttpMethod` is one of the expected verbs. An operator
|
||
who typos `"DLETE"` discovers the issue only when a script invokes that method and
|
||
the remote server rejects the request — usually as a 4xx that the gateway classifies
|
||
as permanent, which is correct but obscures the root cause.
|
||
|
||
**Recommendation**
|
||
|
||
Validate `method.HttpMethod` at gateway entry — either with a small `switch` of
|
||
allowed verbs in `InvokeHttpAsync` that throws `PermanentExternalSystemException` for
|
||
an unsupported verb (cheap, immediate, surfaces a clear error to the script), or by
|
||
adding a validation pass in the Template/Deployment Manager so it can never reach
|
||
the gateway. The first option is local to this module and cheaper to land. Either
|
||
way, the canonical list should agree with `BuildUrl`'s query-vs-body decision (which
|
||
currently knows about POST/PUT/PATCH for body and GET/DELETE for query — note PATCH
|
||
is in the body branch but not the design-doc list; see finding 023).
|
||
|
||
### ExternalSystemGateway-023 — PATCH HTTP method is supported by code but absent from the design doc; body-vs-query decision drifts from the documented set
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:241`, `docs/requirements/Component-ExternalSystemGateway.md:43` |
|
||
|
||
**Resolution (2026-05-28):** Doc-only fix; confirmed PATCH is wired in `ExternalSystemClient.cs:258-260` alongside POST/PUT for body serialization. Added `PATCH` to the design doc's HTTP-method list (line 42) and updated the body/query-parameter sentence (line 75) so the documented set matches the code's `body = POST/PUT/PATCH; query = GET/DELETE` split.
|
||
|
||
**Description**
|
||
|
||
The component design doc lists the supported HTTP methods as `GET, POST, PUT, or
|
||
DELETE` (line 43: `**HTTP method**: GET, POST, PUT, or DELETE.`). `InvokeHttpAsync`'s
|
||
body-serialization branch at lines 239–250 explicitly includes `PATCH` alongside POST
|
||
and PUT — so PATCH is in fact supported (and routes parameters into the JSON body),
|
||
but operators reading the spec would not know it. Conversely, `BuildUrl`'s
|
||
query-string branch at lines 364–366 lists only `GET` and `DELETE`, so a PATCH
|
||
method's parameters always go to the body, matching the body-branch but not appearing
|
||
anywhere in the documented contract.
|
||
|
||
This is mild drift — the code is more permissive than the spec. It only becomes a
|
||
real issue if a future change relies on the documented "only GET/POST/PUT/DELETE"
|
||
set and breaks the PATCH path silently, or if PATCH is genuinely out of scope and a
|
||
template author defines a PATCH method on purpose only to learn later it is
|
||
unsupported.
|
||
|
||
**Recommendation**
|
||
|
||
Pick one direction and apply it in the same session, per the project's "design doc +
|
||
code travel together" rule:
|
||
|
||
- If PATCH is intentionally supported, add `PATCH` to the Component-ExternalSystemGateway.md
|
||
HTTP-method list (line 43) and add a parameterised test confirming a PATCH method
|
||
sends its parameters in the JSON body and resolves like POST/PUT for error
|
||
classification.
|
||
- If PATCH is not in scope, remove `method.HttpMethod.Equals("PATCH", ...)` from the
|
||
body branch in `InvokeHttpAsync` and let finding-022's verb validation reject it.
|
||
The design-doc list then remains the single source of truth.
|