4 Commits

Author SHA1 Message Date
Joseph Doherty 0ba4e49e11 docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine
11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
2026-05-17 00:51:58 -04:00
Joseph Doherty 3b3760f026 docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security
21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
2026-05-17 00:48:25 -04:00
Joseph Doherty 89636e2bbf docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring
17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
2026-05-17 00:45:10 -04:00
Joseph Doherty e49846603e docs(code-reviews): re-review batch 1 at 39d737e — CentralUI, CLI, ClusterInfrastructure, Commons, Communication
17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.
2026-05-17 00:41:21 -04:00
20 changed files with 3204 additions and 165 deletions
+154 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.CLI` |
| Design doc | `docs/requirements/Component-CLI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -31,8 +31,26 @@ ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliCon
`CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON
argument parsing, and the command-tree wiring are untested.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 prior findings are confirmed resolved — the resilience gaps, the dead format
configuration, the credential-handling weakness, and the test-coverage holes have all
been closed, and the test suite has grown substantially (`CommandTreeTests`,
`ManagementHttpClientTests`, `DebugStreamTests`, etc.). The CLI's runtime behaviour is now
solid. This re-review walked all 14 command groups against the full checklist and found
three new issues, all rooted in **update-command and design-document drift** rather than
runtime defects: every `update` command requires the entity's "core" fields (`--name`,
`--script`) even though the design doc presents them as optional, so a partial update is
impossible (CLI-014); the design doc's command surface has drifted again in two specific
places — `template composition delete` and the `data-connection` config flags (CLI-015);
and `WriteAsTable` derives table columns from only the first array element, silently
dropping columns for any later element with a different shape (CLI-016). No
Critical/High issues; the module remains healthy.
## Checklist coverage
_Original review (2026-05-16, `9c60592`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). |
@@ -46,6 +64,21 @@ argument parsing, and the command-tree wiring are untested.
| 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). |
| 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). |
_Re-review (2026-05-17, `39d737e`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Update commands require "core" fields, blocking partial updates (CLI-014); `WriteAsTable` headers derived from first array element only (CLI-016). |
| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR client. No issues. |
| 3 | Concurrency & thread safety | ☑ | `debug stream` concurrency now resolved via `DebugStreamHelpers` (CLI-011/012). No new issues. |
| 4 | Error handling & resilience | ☑ | Malformed-URL / malformed-JSON / connect-cancellation paths all hardened (CLI-004/005/010). No new issues. |
| 5 | Security | ☑ | Env-var credential fallback in place (CLI-006). Basic Auth over HTTP is by design. No new issues. |
| 6 | Performance & resource management | ☑ | `CancellationTokenSource` now `using`-scoped (CLI-011). No new issues. |
| 7 | Design-document adherence | ☑ | Two residual command-surface drifts: `template composition delete` and `data-connection --primary-config` (CLI-015). |
| 8 | Code organization & conventions | ☑ | Consistent and clean; option construction centralised in `CliOptions`. No new issues. |
| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
## Findings
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
@@ -537,3 +570,121 @@ Resolved 2026-05-16 (commit pending). Coverage gaps confirmed and closed:
`InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by
`InstanceArgumentParsingTests` under CLI-005.
The CLI test suite went from 42 to 77 passing tests.
### CLI-014 — `update` commands require "core" fields, making partial updates impossible
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/TemplateCommands.cs:77`, `src/ScadaLink.CLI/Commands/SiteCommands.cs:86`, `src/ScadaLink.CLI/Commands/ExternalSystemCommands.cs:40-42`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:39-40`, `src/ScadaLink.CLI/Commands/NotificationCommands.cs:40-41`, `src/ScadaLink.CLI/Commands/ApiMethodCommands.cs:79` |
**Description**
The design doc presents `update` commands with all non-`--id` fields as optional, e.g.
`template update --id <id> [--name <name>] [--description <desc>] [--parent-id <id>]`
(`Component-CLI.md:62`) and `api-method update --id <id> [--script <code>] ...`
(`Component-CLI.md:224`). The implementation contradicts this: the update commands mark
the entity's "core" fields as `Required = true`, so the user must always re-supply them:
- `template update``--name` is `Required = true` (`TemplateCommands.cs:77`).
- `site update``--name` is `Required = true` (`SiteCommands.cs:86`).
- `external-system update``--name`, `--endpoint-url`, `--auth-type` are all
`Required = true` (`ExternalSystemCommands.cs:40-42`).
- `data-connection update``--name`, `--protocol` are `Required = true`
(`DataConnectionCommands.cs:39-40`).
- `notification update``--name`, `--emails` are `Required = true`
(`NotificationCommands.cs:40-41`).
- `api-method update``--script` is `Required = true` (`ApiMethodCommands.cs:79`).
- The same pattern applies to `template attribute/alarm/script update`
(`TemplateCommands.cs:164-165`, `246-248`, `332-334`) and `role-mapping update`
(`SecurityCommands.cs:110-111`).
Because the corresponding `Update*Command` records are whole-replace (they carry the full
field set, not a sparse patch), a user who wants to change only one field — e.g. flip an
API method's timeout, or rename a template — must look up and re-pass every other field's
current value. Omitting any required flag is a hard parse error. This makes scripted,
single-field updates (a core CLI/CI use case) awkward and error-prone, and it does not
match the documented optional-flag surface.
**Recommendation**
Decide on one model and align doc + code. Either (a) make the update flags genuinely
optional and have the server/`Update*Command` treat a null field as "leave unchanged"
(sparse patch), or (b) if whole-replace is intentional, update `Component-CLI.md` to show
these flags as required (no `[...]`) and document that an update replaces the whole
entity. Option (a) matches the documented surface and the typical CLI expectation.
**Resolution**
_Unresolved._
### CLI-015 — `Component-CLI.md` command surface has drifted again in two places
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-CLI.md:75`, `docs/requirements/Component-CLI.md:125-126` (vs. `src/ScadaLink.CLI/Commands/TemplateCommands.cs:404-413`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:41`, `:86`) |
**Description**
CLI-007 regenerated the doc's "Command Structure" section, but two specific drifts remain
or were introduced:
- `Component-CLI.md:75` documents `template composition delete --template-id <id>
--instance-name <name>`, but the implementation (`TemplateCommands.cs:404-413`) deletes
a composition by its own integer ID via a single `--id` option
(`DeleteTemplateCompositionCommand(id)`). The doc's two-flag form does not exist.
- `data-connection create` and `update` accept a `--primary-config` option (aliased
`--configuration`) for the primary configuration JSON (`DataConnectionCommands.cs:86`,
`:41`), but `Component-CLI.md:125-126` lists only `--backup-config` and
`--failover-retry-count` — the primary-config flag is absent from the doc.
A reader following the doc would use a non-existent `template composition delete` form and
would not discover the `--primary-config` flag.
**Recommendation**
Correct `Component-CLI.md:75` to `template composition delete --id <id>`, and add
`[--primary-config <json>]` to the documented `data-connection create`/`update` signatures
(`Component-CLI.md:125-126`). Also note the `--configuration` alias if aliases are
documented elsewhere.
**Resolution**
_Unresolved._
### CLI-016 — `WriteAsTable` derives columns from the first array element only
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:184-200` |
**Description**
When rendering a JSON array as a table, `WriteAsTable` builds the header set from
`items[0].EnumerateObject()` only (`CommandHelpers.cs:184-186`) and then projects every
row against that fixed header list (`:188-198`). If a later element of the array has a
different shape — additional properties, or properties the first element lacks — those
extra columns are silently dropped from the table and a row missing a header property
renders an empty cell. The user sees a table that appears complete but has omitted data,
with no indication that columns were discarded. (The JSON output path is unaffected; this
only affects `--format table`.) Management API list responses are generally homogeneous,
so the practical impact is low, but a heterogeneous array — e.g. a diff or a mixed-status
list — would be rendered incorrectly with no warning.
**Recommendation**
Compute the header set as the union of property names across all array elements (iterate
all items, collect distinct property names preserving first-seen order) before projecting
rows, so no element's data is silently dropped.
**Resolution**
_Unresolved._
+258 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.CentralUI` |
| Design doc | `docs/requirements/Component-CentralUI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 6 |
## Summary
@@ -32,6 +32,24 @@ Testing coverage is thin for a module this large: only the script analyzer,
TreeView, schema model, and a few data-connection pages have unit tests; most
pages and the auth bridge are untested.
#### Re-review 2026-05-17 (commit `39d737e`)
All 19 findings from the 2026-05-16 review are confirmed closed. The resolution
batch (`a9bd7ee`..`34588ae`) substantially rewrote the auth bridge, the script
sandbox, several Deployment/Monitoring pages, and the shared component disposal
paths, so this re-review re-examined the post-fix state across all 10 checklist
categories. Six new findings (CentralUI-020 .. 025) were recorded. The most
important is **CentralUI-020**: the two prior fixes interact destructively — the
CentralUI-004 fix made `CookieAuthenticationStateProvider` return a frozen,
constructor-time auth-state snapshot, while the CentralUI-005 fix rewrote
`SessionExpiry.razor` to *poll* that same provider to detect a lapsed session.
Because the snapshot never changes for the life of the circuit, the idle-timeout
redirect can never fire, so the documented idle-logout behaviour is silently
defeated. The remaining new findings are a cross-thread `Dictionary` mutation in
`DebugView`, an unguarded `InvokeAsync` in the new `Deployments` push handler,
and three Low-severity items (residual bare `catch`, magic-string claim
lookups, and the untested `SessionExpiry` polling path).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -47,6 +65,14 @@ pages and the auth bridge are untested.
| 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. |
| 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. |
Re-review 2026-05-17 (`39d737e`): all 10 categories re-examined against the
post-fix source. New findings — category 3 (CentralUI-020 the auth-snapshot vs
session-poll interaction is also a design-adherence regression; CentralUI-021
cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
(CentralUI-023 residual bare `catch`), category 8 (CentralUI-024 magic-string
claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
1, 2, 5, 6, 7, 10 produced no new findings.
## Findings
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
@@ -961,3 +987,232 @@ it was failing on the baseline because `DeploymentService` had gained a
`DiffService` constructor dependency from a DeploymentManager contract change
that the test fixture had not been updated for; `DiffService` is now registered
in the fixture.)
### CentralUI-020 — Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:39-62`; `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:29-43` |
**Description**
The CentralUI-004 fix and the CentralUI-005 fix interact destructively.
CentralUI-004 made `CookieAuthenticationStateProvider` snapshot the principal
**once** in its constructor into a cached `Task<AuthenticationState>` and serve
that exact task for the entire life of the SignalR circuit — it never re-reads
`HttpContext`, never calls `SetAuthenticationState`, and never raises
`NotifyAuthenticationStateChanged`. CentralUI-005 then rewrote
`SessionExpiry.razor` to *poll* `AuthStateProvider.GetAuthenticationStateAsync()`
once a minute and redirect to `/login` "once the sliding cookie has actually
lapsed server-side." But `GetAuthenticationStateAsync()` returns the same frozen
constructor-time snapshot on every call — `auth.User.Identity.IsAuthenticated`
is permanently `true` for the life of the circuit regardless of whether the
server-side cookie has expired. The poll loop therefore never observes an
expired session and the redirect never fires. An idle user whose cookie has
lapsed server-side keeps an authenticated-looking page open indefinitely; the
documented "30-minute idle timeout" is silently defeated for any user who
leaves a circuit open. (The cookie middleware would still reject the *next*
full HTTP request / new circuit, so this is a stale-UI / missed-logout exposure
rather than a full auth bypass — but the page continues to render
authenticated content and a SignalR circuit can stay alive for a long time.)
This is also a design-document-adherence regression against CLAUDE.md
"Security & Auth" (30-minute idle timeout) — recorded under Concurrency because
the root cause is the lifetime/staleness mismatch between the two components.
**Recommendation**
`SessionExpiry` must consult something that actually reflects the live
server-side session, not the circuit's frozen principal. Options: (a) have
`SessionExpiry` poll a lightweight authenticated server endpoint (e.g. a
`/auth/ping` minimal API that returns 401 once the cookie has lapsed) and
redirect on 401; or (b) give `CookieAuthenticationStateProvider` a refresh path
that re-validates the cookie and calls `SetAuthenticationState` /
`NotifyAuthenticationStateChanged` so the polled state can actually change.
Whichever is chosen, add a test that exercises the redirect path with an
expired session (see CentralUI-025).
**Resolution**
_Unresolved._
### CentralUI-021 — `DebugView` stream callback mutates `Dictionary` off the render thread
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:404-419,511-519,275-289` |
**Description**
The `onEvent` callback passed to `DebugStreamService.StartStreamAsync` runs on
an Akka/gRPC thread (as the design doc and the CentralUI-009 comments state). It
calls `UpsertWithCap(_attributeValues, …)` / `UpsertWithCap(_alarmStates, …)`
**directly on that thread** — the mutation is not marshalled through
`InvokeAsync`; only the subsequent `StateHasChanged` is. Meanwhile the render
thread evaluates `FilteredAttributeValues` / `FilteredAlarmStates`, which
enumerate `_attributeValues.Values` / `_alarmStates.Values` and call
`OrderBy(...).ToList()`. `Dictionary<TKey,TValue>` is not thread-safe: a write
on the Akka thread concurrent with an enumeration on the render thread can throw
`InvalidOperationException` ("Collection was modified; enumeration operation may
not execute") or corrupt the dictionary's internal buckets. The CentralUI-009
fix added a `_disposed` guard but did not address this data race — the guard
only prevents touching a *disposed* component, not concurrent access to a live
one. Under a busy debug stream this will intermittently fault the page.
**Recommendation**
Marshal the dictionary mutation onto the render thread too — move the
`UpsertWithCap` call inside the `SafeInvokeAsync`/`InvokeAsync` body so all
access to `_attributeValues`/`_alarmStates` happens on the renderer's
dispatcher. Alternatively guard both the writes and the `Filtered*` reads with a
lock, or use a concurrent collection. The cap-trim loop must be inside the same
critical section as the upsert.
**Resolution**
_Unresolved._
### CentralUI-022 — `Deployments` push handler fires `InvokeAsync` with no disposal guard
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:221-229,317-322` |
**Description**
`OnDeploymentStatusChanged` is invoked by `IDeploymentStatusNotifier`, a process
singleton, on the DeploymentManager service thread. The handler does
`_ = InvokeAsync(async () => { await LoadDataAsync(); StateHasChanged(); })`,
discarding the returned task. `Dispose()` unsubscribes the handler, but there is
a race window: the notifier can read the subscriber list and begin invoking
`OnDeploymentStatusChanged` *just before* the component is disposed, so
`InvokeAsync` then runs against a disposed component and throws
`ObjectDisposedException` on the DeploymentManager thread — an unobserved task
exception (the task is fire-and-forget). The same hazard was explicitly fixed
for `DebugView` (CentralUI-009, `SafeInvokeAsync` + `_disposed` flag) and
`ToastNotification` (CentralUI-010), but the new push-based `Deployments`
handler introduced by the CentralUI-006 fix did not adopt the same guard.
Separately, every push event triggers two full repository reloads
(`GetAllInstancesAsync` + `GetAllDeploymentRecordsAsync`) for every open
circuit, so a burst of status changes amplifies into N×2 round-trips per tick.
**Recommendation**
Add a `volatile bool _disposed` set first in `Dispose()`, have
`OnDeploymentStatusChanged` no-op when set, and wrap the `InvokeAsync` dispatch
to swallow `ObjectDisposedException` (mirror `DebugView.SafeInvokeAsync`).
Optionally coalesce bursts (debounce) and/or reload only the changed record
rather than the whole table on each event.
**Resolution**
_Unresolved._
### CentralUI-023 — Residual bare `catch {}` blocks swallow JS interop errors
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:690-698`; `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:107-116,118-130,104` |
**Description**
CentralUI-018 narrowed the bare `catch {}` blocks in `MonacoEditor`,
`TreeView`, and `Sites.razor`, but the same pattern survives elsewhere.
`ParkedMessages.CopyAsync` wraps `navigator.clipboard.writeText` in
`catch { _toast.ShowError("Copy failed."); }` — a real `JSException`
(clipboard permission denied) and an expected `JSDisconnectedException` are
treated identically and neither is logged. `DiffDialog.TryLockBodyAsync` /
`TryUnlockBodyAsync` each have a bare outer `catch` whose handler does another
JS call wrapped in a second bare `catch { /* swallow */ }`, and
`OnAfterRenderAsync`'s `_modalRef.FocusAsync()` is wrapped in a bare
`catch { /* prerender or detached: ignore */ }`. Genuine interop failures in
these paths are invisible in production logs.
**Recommendation**
Catch `JSDisconnectedException` silently and `JSException` (and
`InvalidOperationException` for the prerender focus case) with an `ILogger`
call, consistent with the CentralUI-018 fixes in the same module.
**Resolution**
_Unresolved._
### CentralUI-024 — Claim lookups use magic strings instead of `JwtTokenService` constants
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:102`; `src/ScadaLink.CentralUI/Components/Pages/Dashboard.razor:14`; `GetCurrentUserAsync` in `Templates.razor`, `TemplateEdit.razor`, `TemplateCreate.razor`, `SharedScripts.razor`, `SharedScriptForm.razor`, `Sites.razor`, `Topology.razor`, `InstanceCreate.razor`, `InstanceConfigure.razor` |
**Description**
`ScadaLink.Security.JwtTokenService` exposes the canonical claim-type constants
(`UsernameClaimType = "Username"`, `DisplayNameClaimType = "DisplayName"`,
`RoleClaimType`, `SiteIdClaimType`). `SiteScopeService` correctly uses
`JwtTokenService.SiteIdClaimType`, but every `GetCurrentUserAsync` helper across
ten pages does `authState.User.FindFirst("Username")?.Value`, and `NavMenu` /
`Dashboard` do `context.User.FindFirst("DisplayName")`. The literals happen to
match the constants today, so there is no live bug — but if the claim type is
ever renamed in `JwtTokenService` (the single source of truth) every one of
these call sites silently breaks, falling back to `"unknown"` for the audit
user and a blank display name. The duplicated `GetCurrentUserAsync` helper is
also copy-pasted verbatim into ten components.
**Recommendation**
Replace the string literals with `JwtTokenService.UsernameClaimType` /
`DisplayNameClaimType`. Consider extracting the repeated `GetCurrentUserAsync`
into a single shared helper (e.g. an extension on `AuthenticationStateProvider`
or a small scoped service) so the claim lookup lives in exactly one place.
**Resolution**
_Unresolved._
### CentralUI-025 — `SessionExpiry` polling/redirect path has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.CentralUI.Tests/Auth/SessionExpiryPolicyTests.cs`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor` |
**Description**
`SessionExpiryPolicyTests` covers only `AuthEndpoints.BuildSignInProperties()`
(the sign-in properties shape). The actual runtime behaviour of
`SessionExpiry.razor` — that an expired session triggers a redirect to
`/login`, that an authenticated session does not, and that the component does
not poll/redirect on the `/login` page itself — is untested. Had a behavioural
test exercised the redirect with an expired/anonymous auth state against the
real `CookieAuthenticationStateProvider`, the CentralUI-020 defect (the frozen
snapshot never reporting an expired session) would have been caught. The
component is the system's only client-side idle-logout mechanism, so the gap is
material.
**Recommendation**
Add bUnit tests for `SessionExpiry`: (a) with an unauthenticated auth state the
component navigates to `/login`; (b) with an authenticated state it does not;
(c) on the `/login` route it neither polls nor redirects. The provider used in
the test must be one whose state can actually transition to expired — which
also forces the CentralUI-020 fix.
**Resolution**
_Unresolved._
+126 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ClusterInfrastructure` |
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 2 |
## Summary
@@ -29,20 +29,39 @@ every other component runs on, yet it presently delivers nothing the design requ
The single options class is clean and its test covers defaults and setters
adequately for what exists.
**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008)
were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions`
gained XML docs, the `SectionName` constant, and the `DownIfAlone` property;
`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the
validator and throws from the dead actor-registration method; and the test project
grew to 16 cases across three test classes. The module is in good shape — the
`ClusterOptions` contract, its validator, and the DI registration are all sound,
well-documented, and well-tested. This re-review examined all three source files and
all three test files against the full 10-category checklist and found **two new
issues**, both stemming from work the prior review explicitly deferred to a "Host
review" that has not happened: the `DownIfAlone` property is exposed and validated as
part of the configuration contract but is never consumed — `ScadaLink.Host`'s
`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
## Checklist coverage
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
`39d737e`) are appended in each row.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). |
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. |
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. |
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). |
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. |
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. |
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). |
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. |
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
## Findings
@@ -490,3 +509,97 @@ section, and the component table reflects the true placement. This is a
documentation-only finding, so no runtime regression test is meaningful; verified by
inspection of `ServiceCollectionExtensions.cs` and
`docs/requirements/Component-ClusterInfrastructure.md:21-39`.
### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` |
**Description**
The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as
part of "the split-brain configuration contract". It is public, defaults to `true`,
carries an XML doc presenting it as "the design-doc requirement", and is exercised by
`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it.
The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`,
which **hard-codes** the resolver setting:
```
split-brain-resolver {
active-strategy = ...
stable-after = ...
keep-oldest {
down-if-alone = on
}
}
```
`BuildHocon` receives the full `ClusterOptions` instance and consumes every other
field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`,
`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely.
The result is a configuration property that an operator can set in `appsettings.json`,
that passes validation, and that has **zero runtime effect** — setting
`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly
acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host`
change ... noted for the Host's review") but the wiring was never done and no tracked
finding carried it, so the gap has silently persisted to commit `39d737e`. An inert,
misleadingly-documented configuration knob is a correctness and design-adherence
defect: it gives operators a false sense of control over a safety-critical resolver
setting.
**Recommendation**
Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`
— so the property does what its XML doc claims (a Host-side change, to be tracked in
the Host module's review since `BuildHocon` lives there), or (b) if the flag is
intentionally fixed at `on` and must never be operator-configurable, remove the
`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as
a non-negotiable invariant. Do not leave a public, settable, validated property that
controls nothing.
**Resolution**
_Unresolved._
### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` |
**Description**
`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest
resolver must be configured with `down-if-alone = on`, and the XML doc on
`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale
being that without it the oldest node can run as an isolated single-node cluster
during a partition while the younger node forms its own. `ClusterOptionsValidator`
guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only
strategy, positive timings, heartbeat below the failure threshold) but performs no
check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes
validation cleanly. This is currently latent because CI-009 shows the property is not
consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON
(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node
behaviour the design doc explicitly forbids — with no fail-fast guard. The validator
is the right place to enforce the invariant, consistent with how it already rejects
quorum split-brain strategies.
**Recommendation**
If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to
`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if
some future deployment legitimately needs it off, fails only in combination with the
`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster
hazard. If CI-009 is resolved by removing the property, this finding is moot and
should be closed as resolved alongside it.
**Resolution**
_Unresolved._
+91 -5
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Commons` |
| Design doc | `docs/requirements/Component-Commons.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 2 |
## Summary
@@ -32,14 +32,28 @@ kind of edge-case logic that warrants them. Entity and message contracts otherwi
clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a
wire command.
**Re-review 2026-05-17 (commit `39d737e`).** All twelve prior findings (Commons-001
through Commons-012) are confirmed `Resolved` — the fixes are sound, well-targeted, and
backed by focused regression tests (`StaleTagMonitorRaceTests`, `DynamicJsonElementTests`,
`ScriptParametersTests`, `ManagementCommandRegistryTests`, `OpcUaEndpointConfigSerializerTests`,
`ResultTests`, `ValueFormatterTests`, `ConnectionBindingSerializationTests`,
`FlatteningAndScriptScopeTests`). The new files introduced since `9c60592`
(`TemplateAlarm` lock/inherit fields, `IExternalSystemRepository` name-keyed lookups,
`DeploymentStateQueryRequest`/`Response`, `ParameterDefinition`) follow the established
POCO / record / additive-evolution conventions and carry round-trip compatibility tests.
Two new Low-severity findings were recorded this pass: a `DynamicJsonElement` array
indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSerializer`
legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
No Critical, High, or Medium issues were found.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. |
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. Re-review: `DynamicJsonElement.TryGetIndex` rejects non-`int` indices (Commons-013). |
| 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. |
| 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). |
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). |
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). Re-review: corrupt typed OPC UA rows can fall through to the legacy path and be mislabelled `Legacy` (Commons-014). |
| 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. |
| 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). |
| 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. |
@@ -566,3 +580,75 @@ the parameterless `ToString()`). The XML doc gained a remarks block stating the
culture-invariant contract and why. Regression tests added in `ValueFormatterTests`
(`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`,
each pinned under `de-DE`).
### Commons-013 — `DynamicJsonElement.TryGetIndex` rejects non-`int` index values
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:40-54` |
**Description**
`TryGetIndex` accepts an index only when `indexes[0] is int index`. `DynamicJsonElement`
is designed for dynamic access from scripts (`obj.items[0]`). In a `dynamic` expression the
index operand's runtime type follows the script's variable type — a script that computes
an index in a loop counter or reads it from another `DynamicJsonElement` (whose numbers
are unwrapped as `long` by `Wrap`, see `:105`) will pass a `long`, not an `int`. The
pattern match then fails, `TryGetIndex` returns `false`, and the dynamic binder throws a
`RuntimeBinderException` for what is a perfectly valid in-range index. Because the wrapper
itself surfaces JSON numbers as `long`, `obj.items[obj.count - 1]` — count being a wrapped
JSON number — is the exact failing case. The `int`-only guard also silently rejects
`byte`/`short` indices that would widen to a valid array position.
**Recommendation**
Accept any integral index by converting through `Convert.ToInt64` (guarded for
`OverflowException`) or by matching `int`, `long`, `short`, `byte` and normalizing to a
single integer before the bounds check. Add a regression test indexing with a `long`.
**Resolution**
_Unresolved._
### Commons-014 — `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:107-131` |
**Description**
`Deserialize` tries the typed path first: it parses the document, checks for an
`endpointUrl` property, then calls `JsonSerializer.Deserialize<OpcUaEndpointConfig>`.
The whole block is wrapped in `catch (JsonException) { /* fall through to legacy */ }`.
If a row *is* the current typed shape (it has `endpointUrl`) but is corrupt in a way that
makes `JsonSerializer.Deserialize` throw a `JsonException` — e.g. an enum-valued field
holding an unrecognised string, or a numeric field holding a non-numeric token — the
exception is swallowed and control falls through to `LoadLegacy`. `LoadLegacy` only
requires the root to be a JSON object, so it will usually succeed against the same input
and the result is reported as `OpcUaConfigParseStatus.Legacy`. The Commons-005 fix added
the `Malformed` status precisely so a caller can tell a recoverable legacy row from
unparseable data; this path re-introduces a softer version of the same confusion — a
genuinely broken current-shape row is presented to the user as a benign "please re-save"
legacy row, and the offending field is silently dropped by `FromFlatDict` (which ignores
keys it cannot parse) rather than surfaced. The XML doc describes the legacy fallback as
being for "pre-refactor rows" only and does not mention this branch.
**Recommendation**
Only fall through to `LoadLegacy` when the typed shape is genuinely *not present* — i.e.
the `endpointUrl` property is absent. When `endpointUrl` *is* present but typed
deserialization throws, classify the outcome as `Malformed` (or a distinct status) so the
caller can surface a real error instead of an empty/partial config. Tighten the XML doc
to describe this branch, and add a regression test for a typed row with an invalid enum
field.
**Resolution**
_Unresolved._
+178 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe
good. Test coverage is broad for happy paths but has gaps around failover, cache
mutation races, and the snapshot-timeout cleanup path.
#### Re-review 2026-05-17 (commit `39d737e`)
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
and the fixes hold up against the source. The re-review walked all 10 checklist
categories again and uncovered a previously-missed defect at the centre of the gRPC
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
node's endpoint, but the factory hands back the original NodeA-bound client every
time — so the documented "try the other site node endpoint" failover never actually
moves to NodeB (Communication-012). The same caching defect means a site address
change is never picked up because `RemoveSiteAsync` has no production caller
(Communication-013). Two Low findings round out the re-review: an untrusted
gRPC-supplied `correlation_id` flows straight into an Akka actor name
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
## Findings
@@ -519,3 +536,151 @@ passes after):
(added with this finding's resolution).
The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at
111 passing tests.
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
**Description**
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
the *other* node. Because the factory ignores the new endpoint, the bridge actor
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
but it unsubscribes and re-subscribes on the *same* client/node rather than the
intended other node.)
**Recommendation**
Make the per-site client aware of both endpoints, or key the cache by
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
targets the other endpoint.
**Resolution**
_Unresolved._
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
**Description**
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
a repo-wide search finds **no production caller** — only tests invoke it. Combined
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
in the cached client — it keeps using the stale channel for the life of the process.
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
recreates ClusterClients on change, but there is no equivalent invalidation path
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
an initial misconfiguration will never have working debug streaming until the central
node is restarted.
**Recommendation**
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory`
e.g. have `CentralCommunicationActor` (which already detects address changes in
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
the on-the-fly address-change requirement is intentionally dropped, remove
`RemoveSiteAsync` and correct the design doc.
**Resolution**
_Unresolved._
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
**Description**
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
already been set up for the duration, though the `finally` does not run because the
throw is before the `try`). In practice central always supplies a GUID, so impact is
low, but the server is trusting client-supplied input to be actor-name-safe.
**Recommendation**
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
name solely from the internal `_actorCounter` and keep the correlation ID only as
actor state / dictionary key.
**Resolution**
_Unresolved._
### Communication-015 — No test exercises the real gRPC client factory across a node flip
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
**Description**
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
`GetOrCreate` returns the same client for the same site — it never checks what happens
when the same site is requested with a *different* endpoint.
**Recommendation**
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
**Resolution**
_Unresolved._
+155 -8
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ConfigurationDatabase` |
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -37,6 +37,28 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
`SiteRepository`, `InstanceLocator`) have little or no direct coverage.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module at commit `39d737e`. All eleven findings from the initial
review (`9c60592`) remain `Resolved` — the secret-column encryption
(`EncryptedStringConverter` + `ApplySecretColumnEncryption`), the fail-fast no-arg
DI overload, the `AsSplitQuery` conversions, the audit cycle-safe serializer, and the
added repository test coverage all verified present and consistent with their
resolutions. Three new findings were recorded. The most material is that the
encryption work done for CD-004 left one bearer credential out of scope:
`ApiKey.KeyValue` — the inbound-API authentication secret — is still persisted in
plaintext (`ConfigurationDatabase-012`); it cannot use the same non-deterministic
Data Protection converter because authentication looks the key up *by value*, so it
needs a hash-based scheme instead. The second is a resilience gap in the encryption
plumbing itself: `ApplySecretColumnEncryption` silently substitutes a throwaway
`EphemeralDataProtectionProvider` whenever no provider is supplied, so any context
constructed via the single-argument constructor on a write path would encrypt
secrets with a key discarded at process exit, yielding permanently undecryptable
ciphertext with no error (`ConfigurationDatabase-013`). The third is a minor
inconsistency — a redundant cast on one of the three `HasConversion` calls
(`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
hold up well.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -44,11 +66,11 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
| 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
| 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
| 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). |
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). |
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. |
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. |
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). |
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003, resolved); audit JSON serialization failure handling (CD-007, resolved). Re-review: ephemeral Data Protection fallback can silently produce undecryptable ciphertext (CD-013). |
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002, resolved); SMTP/DB-connection/auth secrets unencrypted (CD-004, resolved). Re-review: `ApiKey.KeyValue` bearer credential still stored in plaintext (CD-012). |
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009, resolved). No N+1 in audited paths. Re-review: no new issues. |
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005, resolved); seed data uses `HasData` consistent with design. Re-review: no new issues. |
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006, resolved); inconsistent null-guard on injected context (CD-011, resolved). Re-review: redundant/inconsistent cast on one `HasConversion` call (CD-014). |
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
@@ -570,3 +592,128 @@ every data-access type behaves uniformly and a hand-constructed instance fails w
informative exception at construction rather than a later `NullReferenceException`.
Regression: `Constructor_NullContext_Throws` tests were added for all four affected types
(`InboundApiRepositoryTests.cs`, `RepositoryCoverageTests.cs`).
### ConfigurationDatabase-012 — Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/InboundApiConfiguration.cs:17-19` |
**Description**
`ApiKey.KeyValue` is the bearer credential presented in the `X-API-Key` header to
authenticate Inbound API requests (HighLevelReqs §7.27.3). It is mapped as an
ordinary `nvarchar(500)` column with a unique index and persisted verbatim. Anyone
with read access to the configuration database — or to any `AuditLogEntry.AfterStateJson`
into which an `ApiKey` entity is serialized — obtains live API credentials in cleartext.
`ConfigurationDatabase-004` introduced encryption-at-rest for the other secret-bearing
columns (SMTP credentials, external-system auth config, database connection strings)
but explicitly scoped `ApiKey.KeyValue` out. The omission is genuine: the
`EncryptedStringConverter` built for CD-004 is backed by ASP.NET Data Protection, which
is **non-deterministic** — the same plaintext encrypts to different ciphertext each
time — so it cannot be applied here, because `GetApprovedKeysForMethodAsync` and the
authentication path resolve a key by its value (`GetApiKeyByValueAsync` does
`FirstOrDefaultAsync(k => k.KeyValue == keyValue)`). A non-deterministic converter would
break that equality lookup. The result is that the one credential most exposed to
external callers is the one credential left unprotected.
**Recommendation**
Store a salted cryptographic hash of the key value instead of the plaintext (or
ciphertext): hash on create, and authenticate by hashing the presented key and
comparing. This keeps the equality lookup working (the hash is deterministic) while
ensuring the database never holds a usable credential. The plaintext key would then
only ever be shown once, at creation time, to the Admin who created it. This requires
a coordinated change with the Inbound API / Security components and the `ApiKey`
entity in Commons; record the chosen scheme in
`docs/requirements/Component-ConfigurationDatabase.md` and the Inbound API design doc.
At minimum, ensure `ApiKey` entities are never passed to `IAuditService` without the
key value redacted.
**Resolution**
_Unresolved._
### ConfigurationDatabase-013 — Secret-column encryption silently falls back to an ephemeral (throwaway) key
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:107-124` |
**Description**
`ApplySecretColumnEncryption` resolves the Data Protection provider as
`_dataProtectionProvider ?? new EphemeralDataProtectionProvider()`. The `??` fallback
is reached whenever the context is constructed via the single-argument
`ScadaLinkDbContext(DbContextOptions)` constructor — i.e. whenever no provider was
injected. An `EphemeralDataProtectionProvider` generates a key ring that lives only in
process memory and is discarded at process exit.
For design-time `dotnet ef` tooling this is harmless (the XML remark correctly notes
it only emits schema). The risk is on a *runtime write path*. The runtime currently
gets the provider-bearing context only because `AddConfigurationDatabase` adds an
`AddScoped` factory registration that overrides EF's activator-based registration.
That override is the single thing standing between correct behaviour and silent data
corruption: any future change that resolves a `ScadaLinkDbContext` through a path the
override does not cover — an `AddPooledDbContextFactory`/`IDbContextFactory<ScadaLinkDbContext>`
registration, a second `AddDbContext` call, a hand-constructed context in server code —
would construct the context with the single-arg constructor, encrypt secret columns
with a throwaway key, and persist ciphertext that becomes **permanently undecryptable
the moment the process restarts**. There is no exception, no warning; the failure only
surfaces later as `CryptographicException` on read (mis-attributed by
`EncryptedStringConverter` to "the stored value was not written by this system").
**Recommendation**
Do not silently substitute an ephemeral provider for write-capable contexts. Either:
(a) require the provider unconditionally and have design-time tooling pass an explicit
ephemeral provider so the intent is visible at the call site; or (b) keep the
single-arg constructor but mark contexts built without a real provider as
schema-only — e.g. record a flag and have the encrypting converter throw a clear
`InvalidOperationException` ("secret columns cannot be written without a configured
Data Protection key ring") on the first `Protect`, instead of producing throwaway
ciphertext. Also harden the DI wiring so a `ScadaLinkDbContext` cannot be resolved
through the EF-activator registration at all (e.g. register only the factory, or use
`AddDbContextFactory` with the explicit constructor).
**Resolution**
_Unresolved._
### ConfigurationDatabase-014 — Redundant, inconsistent cast on one `HasConversion` call
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:121-123` |
**Description**
`ApplySecretColumnEncryption` calls `HasConversion(converter)` three times. The first
two (`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`)
pass `converter` directly; the third (`DatabaseConnectionDefinition.ConnectionString`)
casts it to `(Microsoft.EntityFrameworkCore.Storage.ValueConversion.ValueConverter)`.
`EncryptedStringConverter` already derives from `ValueConverter<string?, string?>`
(itself a `ValueConverter`), and the first two call sites compile fine without the
cast, so the cast is redundant. The inconsistency makes the code look as though the
third call needs special handling when it does not, and the fully-qualified type name
inline adds noise.
**Recommendation**
Remove the cast so all three calls read identically as `HasConversion(converter)`.
If a `ValueConverter`-typed reference is genuinely wanted, give it a local variable of
that type once and use it for all three.
**Resolution**
_Unresolved._
+208 -12
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
and found **4 new findings**: one **High** — the DCL-012 security warning is never
seen in production because `RealOpcUaClientFactory.Create()` constructs
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
**Medium** — initial-connect failures in the `Connecting` state never count toward
failover, so a connection whose primary endpoint is unreachable at startup retries the
primary forever and never tries the configured backup; one **Medium**
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
telling the Instance Actor the subscribe succeeded when it did not; and one **Low**
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
so a mid-batch disconnect aborts the whole write batch (the same class of defect
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
`DataConnectionLayer-014`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
| 9 | Testing coverage | x | DCL001013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
## Findings
@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
non-atomic check-then-set it double-fires (verified by temporarily reverting the
guard), and it passes against the atomic fix.
### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
**Description**
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
However, the only production code path that constructs a `RealOpcUaClient` is
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
As a result the `_logger` field is always `NullLogger` for every real OPC UA
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
enabled — accepting any server certificate on an industrial control link — gets no
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
therefore not actually effective in production; only the unit test
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
checks the default value) passes.
**Recommendation**
Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)`
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
the default is `false`.
**Resolution**
_Unresolved._
### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
**Description**
Failover between the primary and backup endpoints is implemented in two places, both
reachable only after a connection has already been `Connected` at least once:
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
`HandleConnectResult` — the handler for the *initial* connection attempt in the
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
Consequence: if the primary endpoint is unreachable when the connection actor first
starts — which is the common case after a fresh artifact deployment, a site restart,
or a primary that is simply down at that moment — the actor retries the *primary*
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
active endpoint becomes unreachable") is silently not honoured for the
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
despite a healthy backup being configured.
**Recommendation**
Make `HandleConnectResult` participate in the failover counter the same way
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
log the failover event) before re-arming the timer. Alternatively, fold the initial
connect into the same reconnect path so there is a single failover decision point. Add
a regression test for "primary down at startup, backup configured → fails over to
backup".
**Resolution**
_Unresolved._
### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
**Description**
`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
so the `Connected`-state handler can drive the actor into `Reconnecting`
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
level.
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
told the subscribe **succeeded**, while the connection actor simultaneously transitions
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
healthy" from "accepted, but the connection is currently down" — a misleading
success signal on a request that did not do what the response claims.
(Genuine tag-resolution failures are arguably also reported as overall `true`, but
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
already pushed. The connection-level case is the clear defect because the actor itself
treats it as a failure worth a state transition.)
**Recommendation**
When `connectionLevelFailure` is true, reply with
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
Instance Actor can reflect partial success. Add a test asserting the response is not
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
**Resolution**
_Unresolved._
### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
**Description**
`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
status does not abort the batch), but it first calls `EnsureConnected()`
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
caller gets no result map — losing the per-tag outcomes for the tags that already
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
Severity is Low because device writes are real-time control operations with no
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
maintainability and correctness wart.
**Recommendation**
Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
and continues, while still propagating `OperationCanceledException` to abort a
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
complete, consistent result map.
**Resolution**
_Unresolved._
+158 -12
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -30,20 +30,43 @@ detail. Configuration is not bound to `appsettings.json`, leaving one option
entirely dead. Test coverage stops at the communication boundary and never
exercises a successful deployment or the lifecycle success paths.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the batch of fixes for
DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
verified against source — the broadened catch, non-cancellable cleanup writes,
ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
structured diff, options binding, and the expanded TestKit-actor test suite are
all present and correct. The module is in markedly better shape than the
first review: error paths are now defensively handled and test coverage is
broad (successful deploy/lifecycle, lock serialization, reconciliation
matrix, artifact per-site matrix).
This re-review found **3 new findings**, all clustered on the
DeploymentManager-006 reconciliation path added since the last review. The
reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
record `Success` when the site already has the target revision, but it does
**not** perform the side effects the normal success path does — it never
updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
it still describes the query-before-redeploy behaviour that actually moved into
`TryReconcileWithSiteAsync` (DeploymentManager-017).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
## Findings
@@ -710,3 +733,126 @@ the communication boundary. New tests:
`DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
(per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |
**Description**
`TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
and the site reports it already has the target revision hash. In that case it
marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
returns it — the caller then returns `Result.Success` and **never enters the
normal deploy body**.
The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
a successful site response: writes the deployment record terminal status, sets
`instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
`StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
first. Consequently, after a reconciled deployment:
- The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
first-time deploy that timed out, or `Disabled`) even though the site is
actually running the configuration — the central state machine and the site
diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
rejected or allowed incorrectly by `StateTransitionValidator`.
- No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
permanently returning `"No deployed snapshot found for this instance."`, and a
redeploy reconciliation leaves the stored snapshot showing the *old* config
even though the deployment record claims `Success` for the new revision.
The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
deployed snapshot and instance state to reflect the last successful deployment;
the reconciliation path silently breaks both invariants.
**Recommendation**
In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
same post-success side effects as the normal path: set `instance.State =
InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
`StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
config JSON. Factor the shared post-success logic into one helper so the normal
and reconciliation paths cannot drift. Add a regression test asserting that a
reconciled deployment leaves the instance `Enabled` and a snapshot stored.
**Resolution**
_Unresolved._
### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |
**Description**
When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
`prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
`prior.RevisionHash`. The reconciliation condition only compares the *site's*
`AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
does not require `prior.RevisionHash` to equal either of them.
The prior record can legitimately carry a different revision hash than the
current target: e.g. a deploy timed out at revision `R1`, the template was then
edited so the current flatten yields `R2`, and meanwhile the site actually
applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
the prior record predates a hash recompute). After reconciliation the record's
`Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
instance as deployed at the wrong revision. The audit `DeployReconciled` entry
records `RevisionHash = targetRevisionHash`, contradicting the persisted record.
**Recommendation**
In the reconciled-success branch, also set `prior.RevisionHash =
targetRevisionHash` so the persisted record, the audit entry, and the site's
actual applied revision all agree. Alternatively, only reconcile when
`prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
normal deploy.
**Resolution**
_Unresolved._
### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |
**Description**
The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
failover/timeout, query site for current deployment state before
re-deploying."* The method body does no such thing — it is a one-line
pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
read. The query-the-site-before-redeploy behaviour the comment describes was
implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
The stale comment is a leftover of the original design intent and misleads a
reader into thinking this method contacts the site.
**Recommendation**
Reword the summary to describe what the method actually does — "returns the
current persisted `DeploymentRecord` for the given deployment ID from the
configuration database" — and, if useful, cross-reference
`TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.
**Resolution**
_Unresolved._
+175 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ExternalSystemGateway` |
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -30,20 +30,41 @@ Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGa
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
requirements (timeout, retry settings) that are declared but not implemented.
#### Re-review 2026-05-17 (commit `39d737e`)
All fourteen prior findings remain `Resolved`; the resolutions for findings 001014
were spot-checked against the current source and hold. The re-review walked the full
10-category checklist again and surfaced **three new findings**. The most serious
(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
per-system/per-connection `MaxRetries` of `0` through verbatim, but
`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
option) onto every `HttpClient` in the host, including the Notification Service's
OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
both substantive findings are second-order defects in earlier fixes — the earlier
resolutions did not verify the downstream contract of the S&F engine they integrate
with.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
## Findings
@@ -760,3 +781,144 @@ headers and body so URL/auth/body construction is now verified, not just status
These are new-coverage tests against already-correct behaviour, so they pass on the
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
against regression.
### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
**Description**
The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
`MaxRetries` of `0` means **"never retry"**.
That is the opposite of what the Store-and-Forward engine actually does with the
value. `EnqueueAsync` stores the passed `maxRetries` directly into
`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
every sweep, so the message is **never parked and is retried forever**.
Consequences for a `CachedCall`/`CachedWrite` against a system or connection
configured with `MaxRetries = 0`:
1. A transiently-failing message that the operator intended never to retry is instead
retried on every sweep indefinitely, accumulating in the buffer and repeatedly
re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
the idempotency note in the design doc warns about.
2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
describe that as "honoured" — they verify the persisted column, never the resulting
sweep behaviour, so they lock in the broken semantics.
3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
cached write at every site currently buffers as retry-forever. Before the ESG-004
fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
common path.
**Recommendation**
Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
that "never retry" then needs a different representation such as a nullable field or a
`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
explicit no-retry path — e.g. do not enqueue at all on transient failure when
`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
outcome (parked / not retried), not just the stored column value.
**Resolution**
_Unresolved._
### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
**Description**
The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
The inline comment claims this "applies to the dynamically-named clients created by
`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
anywhere in the host, regardless of name.
The Host registers the External System Gateway alongside other components that also
use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
has its **primary handler replaced** by a `SocketsHttpHandler` whose
`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
1. Silently caps unrelated clients at a value owned by, and named for, a different
component — an operator tuning the ESG option unknowingly throttles Microsoft 365
OAuth2 token acquisition.
2. Overrides/discards any primary-handler configuration those other components add for
their own clients (e.g. a custom handler, proxy, or certificate settings).
This is a leaky, surprising side effect for what the option claims to be a per-ESG
setting; `ConfigureHttpClientDefaults` should not be used to express a single
component's policy.
**Recommendation**
Scope the handler configuration to the gateway's own clients only. The ESG already
creates per-system clients with a deterministic name pattern
(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
on that registration instead of on the global defaults. If a name-pattern registration
is impractical, document the global effect explicitly and rename the option, but the
preferred fix is to stop using `ConfigureHttpClientDefaults`.
**Resolution**
_Unresolved._
### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
**Description**
In `BuildUrl`, the GET/DELETE query-string branch is entered when
`parameters != null && parameters.Count > 0`, but the projection then filters out
null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
method is invoked with a non-empty parameter dictionary whose values are *all* null,
`queryString` is the empty string and the code still executes `url += "?" + queryString`,
producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
on some endpoints, and makes captured request URLs harder to read in logs.
**Recommendation**
Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
string first and check it), so a method whose effective parameter set is empty
produces a clean URL identical to the no-parameters case.
**Resolution**
_Unresolved._
+170 -7
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
but the concurrency issues are Medium/High and the missing S&F metric is a real
design-adherence gap.
#### Re-review 2026-05-17 (commit `39d737e`)
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved`
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
central-site offline grace and the unknown-site heartbeat registration are in
place, and the test suite has grown to cover the report loop, heartbeat path, and
collector setters. This re-review found **4 new findings, all Low/Medium, none
crash-class**. They are residual polish items rather than behaviour regressions:
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
`HealthMonitoringOptions` intervals that crash the hosted service on
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
special-case (HealthMonitoring-015), and `CollectReport` reading
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
broadly faithful to the design intent.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
## Findings
@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
the contract is now honest and no further code change was required.
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
**Description**
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
the (shorter) offline timeout interval for timely detection", but the code only
reads `_options.OfflineTimeout`:
```csharp
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
```
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
comment states an invariant the code does not enforce: if an operator configures
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
misleads a reader into believing the cadence already adapts to whichever timeout is
shorter.
**Recommendation**
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
so the code matches the comment, or drop the "(shorter)" wording and state plainly
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
`CentralOfflineTimeout` is the larger value).
**Resolution**
_Unresolved._
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
**Description**
`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
division for the check interval). `PeriodicTimer`'s constructor throws
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
hosted service at startup with an opaque exception that does not name the
offending config key, rather than failing fast with a clear validation message.
**Recommendation**
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
configuration fails fast with a message naming the section and key.
**Resolution**
_Unresolved._
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
**Description**
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
field is "Used by the UI to surface report staleness during failover." A
heartbeat-only site therefore has `LatestReport == null` **and**
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
"last report N ago" as `now - LastReportReceivedAt` without first checking
`LatestReport != null` will render a nonsensical staleness of roughly two
thousand years for a site that is, in fact, freshly reachable. The two
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
are independent and both must be special-cased; the sentinel value is an easy trap.
**Recommendation**
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
with how `LatestReport` was already made nullable for the same case — and have UI
consumers render staleness only when it has a value. Alternatively, document the
`default` sentinel prominently on the field and audit every UI reader, but the
nullable option is safer and matches the existing `LatestReport` treatment.
**Resolution**
_Unresolved._
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
**Description**
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
read directly from the system clock. Every other time-dependent class in the
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
— was deliberately refactored (HealthMonitoring-006) to take an injectable
`TimeProvider` so the behaviour is deterministically testable and the clock
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
timestamp cannot be controlled in a unit test, which is why
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
known instant. This is a minor consistency/testability gap, not a behaviour bug.
**Recommendation**
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
testable and the module is consistent.
**Resolution**
_Unresolved._
+180 -13
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
class, but the readiness bug is High because it breaks load-balancer behaviour with
no safe workaround.
#### Re-review 2026-05-17 (commit `39d737e`)
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
in the current tree — the `/health/ready` predicate, the externalised secrets, the
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
described in their Resolution notes. This re-review walked all ten checklist
categories again over the full module and recorded four new findings, none of them
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
of cluster timing values silently degrades any sub-second configuration (Host-013),
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
## Findings
@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
— they assert the configured level actually filters log events; they pass against the
new factory (and would fail against the old inline configuration, which ignored the
key). Full Host suite green (175 passed).
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
**Description**
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
documented as "the design-doc requirement", default `true` — but a repo-wide search
shows it is referenced **nowhere outside its own declaration**. The Host therefore
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
on`. This is the same class of defect as the resolved Host-011
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
declared, bound, and documented but never consumed, which silently misleads any
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
ClusterInfrastructure project owns only the configuration contract, per the
`ClusterOptions` class comment).
**Recommendation**
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
doc comment instead so code and configuration contract agree — but do not leave it
declared-but-dead.
**Resolution**
_Unresolved._
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
**Description**
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0`
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
the change is a correctness hazard for exactly the kind of failure-detection tuning
these options exist for.
**Recommendation**
Render durations without precision loss — emit milliseconds (e.g.
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
validate in `StartupValidator` that each cluster timing value is a positive whole
number of seconds and fail fast otherwise. Either way, do not silently round.
**Resolution**
_Unresolved._
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
**Description**
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
configuration section — but neither `appsettings.json` nor either role-specific file
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
console output template, the file path, and the rolling interval cannot be changed
without recompiling — an operator cannot redirect logs, change the file location, or
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
call is effectively a no-op because the section it reads does not exist.
**Recommendation**
Move the console and file sink definitions into a `Serilog` section in
`appsettings.json` (with `WriteTo` entries and the output template / file path /
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
state the sinks are intentionally code-defined — but the design doc currently says
"configuration-driven", so code and doc must be reconciled.
**Resolution**
_Unresolved._
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
**Description**
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
on a Central node in production *validates* the schema version and throws if the
deployed schema does not match the expected migration set. A schema-version mismatch
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
that is fatal on the first attempt. The retry helper is meant for *transient*
connection faults (the XML doc says exactly that: "the database may be briefly
unreachable"), but it cannot distinguish transient from permanent failures.
**Recommendation**
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
isTransient` predicate and, for the migration call site, treat only
connection-class exceptions (`SqlException` with a connection/transport error
number, `TimeoutException`, socket errors) as retryable while letting
schema-validation / `InvalidOperationException` failures propagate immediately. Add
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
single attempt.
**Resolution**
_Unresolved._
+204 -7
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.InboundAPI` |
| Design doc | `docs/requirements/Component-InboundAPI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -30,6 +30,25 @@ well but there is no coverage of the HTTP endpoint, concurrency, or recompilatio
None of the findings are data-loss-class, but the concurrency and trust-model issues
are High severity and should be addressed before production use.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 findings from the initial review remain `Resolved`; the module source under
`src/ScadaLink.InboundAPI` is unchanged since the last InboundAPI fix commit
(`8dd7412`), which precedes `39d737e`. This re-review re-walked all 10 checklist
categories against the resolved code and surfaced **4 new findings** — none touching
the previously-fixed concurrency/trust-model code, but all in areas the first pass
did not probe deeply: (1) the `ReturnDefinition` column is loaded onto `ApiMethod`
but is never consulted — script return values are serialized verbatim with no shaping
or validation against the declared return structure (InboundAPI-014); (2) the new
`ForbiddenApiChecker` is a purely textual syntax walker and can be bypassed by
reaching forbidden functionality through member access that never spells a forbidden
namespace, e.g. `typeof(x).Assembly.GetType("System.IO.File")` (InboundAPI-015);
(3) routed `Route.To().Call()` invocations are not bound by the method timeout unless
the script explicitly threads `Parameters`-side cancellation, contradicting the design
statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteHelper`
/ `RouteTarget` — the entire WP-4 cross-site routing surface — has no test coverage
(InboundAPI-017). New findings are one Medium-trio plus one Low; no Critical or High.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -38,11 +57,11 @@ are High severity and should be addressed before production use.
| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. |
| 5 | Security | ☑ | Prior items resolved. Re-review: `ForbiddenApiChecker` is a textual deny-list bypassable via reflection without a forbidden namespace token (InboundAPI-015). |
| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". |
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. |
| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. |
| 7 | Design-document adherence | ☑ | Re-review: `ReturnDefinition` loaded but never used (InboundAPI-014); routed-call timeout not enforced (InboundAPI-016). Prior `Database.Connection()`/central-only items resolved. |
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` moved to Commons (InboundAPI-012 resolved); no new issues. |
| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
## Findings
@@ -580,3 +599,181 @@ the new method-not-found status, and removing dead code cannot regress. Doc-owne
follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a
"method not found" status; it should note that it is reported as 403 (indistinguishable
from "key not approved"), but that doc edit is outside this module's editable scope.
### InboundAPI-014 — `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:201-205`, `src/ScadaLink.Commons/Entities/InboundApi/ApiMethod.cs:10` |
**Description**
`Component-InboundAPI.md` ("API Method Definition → Return Value Definition" and the
"Response Format" section) specifies that each method has a declared return structure
— "Field names and data types … Supports returning lists of objects" — and that the
success response body is "the method's return value as JSON, with fields matching the
return value definition". The `ApiMethod` entity carries a `ReturnDefinition` column
to hold exactly this. However, nothing in the module ever reads `ReturnDefinition`:
`ExecuteAsync` takes whatever object the script happens to return and does a blind
`JsonSerializer.Serialize(result)`. There is no validation that the script's return
value matches the declared shape, no coercion to the declared field types, and no
error when a method returns a structure inconsistent with its definition. A method
whose script returns the wrong shape (or `null` where a structure is required) will
silently emit a malformed 200 response, and the documented return-definition contract
is effectively unenforced. This is the response-side mirror of the parameter
validation that `ParameterValidator` does perform, leaving the two halves of the
method contract asymmetric.
**Recommendation**
Either (a) implement return-value validation/shaping: parse `ReturnDefinition` with
the same extended-type machinery used for parameters and validate/coerce the script
result before serializing, returning a 500 (or logging) when the script result does
not match; or (b) if return shaping is deliberately out of scope, remove the "Return
Value Definition" / "fields matching the return value definition" language from
`Component-InboundAPI.md` and document that the response is the script's raw return
value serialized as-is. Code and design doc must be reconciled.
**Resolution**
_Unresolved._
### InboundAPI-015 — `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/ForbiddenApiChecker.cs:63-119`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:109-126` |
**Description**
`ForbiddenApiChecker` walks the script syntax tree and rejects any `using` directive,
`QualifiedNameSyntax`, or `MemberAccessExpressionSyntax` whose textual dotted name
starts with a forbidden namespace prefix (`System.IO`, `System.Diagnostics`,
`System.Reflection`, `System.Net`, etc.). This is a textual match, not a semantic
one, and the trust model it enforces (per InboundAPI-005) is explicitly meant to keep
*untrusted* Design-role scripts away from host APIs. The check can be bypassed because
forbidden functionality is reachable through member access that never spells a
forbidden namespace:
- `typeof(string).Assembly.GetType("System.IO.File")``typeof(string)` is permitted,
`.Assembly` is a `System.Type` property, `.GetType(string)` is a `System.Reflection.Assembly`
method. The string literal `"System.IO.File"` is a string, not a `QualifiedNameSyntax`
or `MemberAccessExpressionSyntax`, so `IsForbidden` never sees it. The script obtains
a `System.IO.File` `Type` and can `InvokeMember`/`GetMethod(...).Invoke(...)` on it —
all via members of permitted types — with no forbidden namespace ever appearing in
the source. `CompileAndRegister` references `typeof(object).Assembly`
(System.Private.CoreLib) in `ScriptOptions`, so every framework type is loadable at
runtime.
- The executor also references the `Microsoft.CSharp.RuntimeBinder` assembly
(`InboundScriptExecutor.cs:116`), enabling the `dynamic` keyword, which further
widens late-bound member access that the static walker cannot see through.
Because the inbound API script runs on the central node with the host process's
privileges and is authored by the (less-trusted-than-Admin) Design role, a static
textual deny-list gives a false sense of containment.
**Recommendation**
Treat the syntax walker as defence-in-depth, not the boundary. Strengthen it where
cheap (flag `Assembly.GetType`, `Type.GetType`, `Activator.CreateInstance`,
`InvokeMember`, and `dynamic` usage), but for real enforcement run compiled scripts
under a genuine boundary — a restricted `AssemblyLoadContext`/AppDomain-equivalent, a
curated reference set that does not expose reflection-to-arbitrary-type, or an
out-of-process sandbox — consistent with however the Site Runtime ultimately enforces
its instance-script trust model. At minimum, document in `Component-InboundAPI.md`
that the current check is best-effort and does not stop a determined script.
**Resolution**
_Unresolved._
### InboundAPI-016 — Routed `Route.To().Call()` invocations are not bound by the method timeout
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:59-152`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:177`, `:199` |
**Description**
`Component-InboundAPI.md` states the per-method timeout "defines the maximum time the
method is allowed to execute (**including any routed calls to sites**)", and the
Routing Behavior section says a routed call "blocks until the site responds or the
**method-level timeout** is reached". The executor builds a linked
`CancellationTokenSource` (`cts`) combining the request-abort token and a dedicated
timeout CTS, and exposes `cts.Token` to the script as `InboundScriptContext.CancellationToken`.
However, every `RouteTarget` method (`Call`, `GetAttribute(s)`, `SetAttribute(s)`)
takes `CancellationToken cancellationToken = default` and the script must *explicitly*
pass the context token for the routed call to honour the timeout. A natural script —
`Route.To("inst").Call("doWork", parameters)` — invokes the routed call with
`CancellationToken.None`. That request flows into `CommunicationService.RouteToCallAsync`
with no cancellation, so the routed call is not bounded by the method timeout at all.
The only timeout guard left is `handler(context).WaitAsync(cts.Token)` in
`ExecuteAsync`: when the method timeout fires, `WaitAsync` returns a cancellation to
the caller, but the underlying script `Task` — and the in-flight `RouteToCallAsync`
awaiting a remote site — keeps running orphaned with no cancellation, holding the
correlation/communication resources until the site eventually responds or its own
transport timeout (if any) fires. The design's guarantee that the method timeout
covers routed calls is therefore not met, and a slow/hung site can leak background
work past the timeout the caller was told bounds the request.
**Recommendation**
Make routed calls inherit the method deadline without relying on script discipline:
have `RouteHelper`/`RouteTarget` carry the executing method's `CancellationToken`
(injected by `InboundScriptExecutor` when it constructs the context, e.g. a
`RouteHelper` bound to `cts.Token`) and pass it into every `CommunicationService`
call by default, so `Route.To("x").Call("s", p)` is timeout-bounded with no token
argument. Keep the explicit-token overload for callers that want a tighter bound.
Verify `RouteToCallAsync` and the attribute-routing calls actually observe the token
and abandon the in-flight request when it fires.
**Resolution**
_Unresolved._
### InboundAPI-017 — `RouteHelper` / `RouteTarget` has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:1-165`, `tests/ScadaLink.InboundAPI.Tests/` |
**Description**
`RouteHelper`/`RouteTarget` is the entire WP-4 cross-site routing surface — the
`Route.To().Call()/GetAttribute(s)/SetAttribute(s)` API that inbound API scripts use
to reach instances at any site. It has zero tests: the `ScadaLink.InboundAPI.Tests`
project covers `ApiKeyValidator`, `ParameterValidator`, `InboundScriptExecutor`, and
`InboundApiEndpointFilter`, but no test file exercises `RouteHelper`. Untested
behaviours include site resolution via `IInstanceLocator` (including the
"instance not found / no assigned site" `InvalidOperationException` path at
`RouteHelper.cs:154-164`), the `!response.Success``InvalidOperationException`
translation in each routed method, `GetAttribute` delegating to the batch
`GetAttributes` and returning `null` for an absent key, correlation-ID generation,
and `SetAttribute` delegating to `SetAttributes`. These are non-trivial branches
whose failure modes (a thrown exception inside a script) surface to the caller as a
500, so regressions would be silent.
**Recommendation**
Add a `RouteHelperTests` suite using substituted `IInstanceLocator` and
`CommunicationService` (the executor tests already substitute `CommunicationService`):
cover the happy path of each routed method, the unresolved-instance throw, the
`!Success``InvalidOperationException` mapping, and `GetAttribute` returning `null`
for a missing key. This also gives InboundAPI-016 a regression home if the timeout
wiring is added.
**Resolution**
_Unresolved._
+162 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ManagementService` |
| Design doc | `docs/requirements/Component-ManagementService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (1 Deferred — see ManagementService-012) |
| Commit reviewed | `39d737e` |
| Open findings | 4 (1 Deferred — see ManagementService-012) |
## Summary
@@ -28,6 +28,24 @@ instances never disposed) and dead/unused configuration. None of the findings ar
crash-class, but the site-scope gaps are High severity because they are a real
authorization bypass with no workaround.
#### Re-review 2026-05-17 (commit `39d737e`)
All thirteen prior findings remain correctly closed; the source under
`src/ScadaLink.ManagementService` is byte-identical between the previously reviewed state
and `39d737e` (the resolution commits of findings 001013 are folded into the history at or
before `39d737e`). ManagementService-012 was re-checked and its **Deferred** status still
holds: `ManagementEnvelope.Command` is still typed `object`, and the marker-interface fix
still belongs in the Commons module, outside this module's edit scope — nothing has changed
to make it actionable here. This re-review re-ran the full 10-category checklist against the
current sources and surfaced **four new findings**. The dominant theme is the same
site-scope authorization gap that findings 001/002 closed: `HandleQueryDeployments`
(`QueryDeploymentsCommand`) was overlooked by that sweep and still performs no site-scope
enforcement, letting a site-scoped Deployment user read deployment history for any site
(014, High). The remaining three are lower severity: a non-atomic multi-override mutation
that can leave an instance partially modified after an error (015, Medium), raw exception
messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
`QueryDeploymentsCommand` having no test coverage at all (017, Low).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -551,3 +569,144 @@ infrastructure, so the testable command-parsing/dispatch logic was extracted int
`ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5),
`SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the
pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed.
### ManagementService-014 — HandleQueryDeployments bypasses site-scope enforcement
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:306`, `:1174` |
**Description**
`QueryDeploymentsCommand` is gated to the `Deployment` role by `GetRequiredRole`
(`ManagementActor.cs:170``:177`), and the design document's Authorization section states
"Site scoping is enforced for site-scoped Deployment users" and explicitly lists deployments
among the Deployment-role operations. `HandleQueryDeployments` makes no call to
`EnforceSiteScope` / `EnforceSiteScopeForInstance` / `EnforceSiteScopeForIdentifier`: with no
`InstanceId` it returns `IDeploymentManagerRepository.GetAllDeploymentRecordsAsync()` (every
deployment record across all sites), and with an `InstanceId` it returns that instance's
deployment history with no check that the instance's site is within the caller's permitted
set. A site-scoped Deployment user scoped to site A can therefore enumerate deployment
records for instances at site B — instance IDs, `DeployedBy` (operator usernames),
timestamps, deployment status, and `ErrorMessage` content — by issuing `QueryDeployments`
with or without an out-of-scope `InstanceId`. This is the same authorization-bypass class as
the resolved findings 001/002, on a handler that sweep did not cover; it is `DispatchCommand`'s
only `Deployment`-role handler with no scope enforcement.
**Recommendation**
Thread `AuthenticatedUser` into `HandleQueryDeployments` (the dispatch case at line 306
already has `user` in scope). When `cmd.InstanceId` is supplied, call
`EnforceSiteScopeForInstance` before querying. When it is not supplied, filter the returned
`DeploymentRecord` list to the caller's permitted sites — resolve each record's instance to
its `SiteId` (or join through a site-aware repository query) and drop records for sites
outside `PermittedSiteIds`, mirroring the `HandleListInstances` / `HandleListSites` filter
pattern. Add a regression test for a site-scoped user against in-scope and out-of-scope
instances.
**Resolution**
_Unresolved._
### ManagementService-015 — HandleSetInstanceOverrides applies overrides non-atomically
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:647``:659` |
**Description**
`HandleSetInstanceOverrides` iterates `cmd.Overrides` and calls
`InstanceService.SetAttributeOverrideAsync` once per attribute, throwing
`InvalidOperationException` on the first `result.IsSuccess == false`. Each
`SetAttributeOverrideAsync` call persists independently, so if the command supplies five
overrides and the third fails (e.g. an unknown attribute name or a validation error), the
first two overrides are already committed to the configuration database while the caller
receives a `ManagementError`. The instance is left partially mutated in a state the operator
neither sees nor requested, and the per-instance operation lock referenced in the design's
deployment decisions does not protect against this because the partial writes are committed
before the throw. A retry of the same command then re-applies the already-applied overrides.
**Recommendation**
Make the multi-override mutation all-or-nothing: either validate every requested override up
front before applying any, or apply all overrides within a single transaction / unit-of-work
so a mid-batch failure rolls back the earlier writes. If `InstanceService` cannot offer a
batch method, at minimum document the partial-application behaviour on `SetInstanceOverridesCommand`
and have the handler report which overrides were applied before the failure so the caller
can reconcile.
**Resolution**
_Unresolved._
### ManagementService-016 — Unexpected exception messages returned verbatim to HTTP callers
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:121``:131` |
**Description**
`MapFault` maps any non-`SiteScopeViolationException` fault to
`new ManagementError(correlationId, cause.Message, "COMMAND_FAILED")`, and
`ManagementEndpoints.HandleRequest` returns that `Error` string directly in the HTTP 400
body. For handler-thrown `InvalidOperationException`s carrying a curated `result.Error`
message this is intended and safe. But the same path also surfaces the raw `.Message` of
unanticipated exceptions — a `SqlException` (which can include server/database names and
constraint details), a `DbUpdateException`, an `ArgumentException` from `Enum.Parse` on a
malformed `DataType`/`TriggerType` value, or a `NullReferenceException` — straight to the
external CLI/HTTP client. This is a minor internal-detail disclosure surface: the exception
text is already logged server-side with full context, so the client does not need the raw
message.
**Recommendation**
Distinguish handler-curated failures from unexpected faults. Have handlers throw a dedicated
exception type (e.g. `ManagementCommandException`) for messages that are safe to surface, and
in `MapFault` return that message for the known type while returning a generic
"An internal error occurred (CorrelationId=...)" string for everything else — the operator
can still correlate to the server log via the correlation ID.
**Resolution**
_Unresolved._
### ManagementService-017 — QueryDeploymentsCommand has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` |
**Description**
`QueryDeploymentsCommand` / `HandleQueryDeployments` is exercised by no test in
`ManagementActorTests`. There is no test that it requires the `Deployment` role, no test of
the `InstanceId`-filtered versus unfiltered branches, and — because the handler performs no
site-scope enforcement at all — no test that would have caught finding 014. The deployment
query is one of the operations the design's Authorization section calls out for site
scoping, yet it is the only `Deployment`-role command with neither an authorization test nor
a site-scope test.
**Recommendation**
Add tests for `QueryDeploymentsCommand`: a role test (Design/no-role caller →
`ManagementUnauthorized`), branch coverage for the `InstanceId`-filtered and unfiltered
repository calls, and — once finding 014 is fixed — site-scope tests for a site-scoped
Deployment user against in-scope and out-of-scope deployment records.
**Resolution**
_Unresolved._
+133 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.NotificationService` |
| Design doc | `docs/requirements/Component-NotificationService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 5 |
## Summary
@@ -30,6 +30,31 @@ plaintext `string` values. Test coverage exercises the happy path and the main
error branches but misses the OAuth2 delivery path, the permanent-classification
fallback in `DeliverAsync`, and concurrency on the token cache.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the NS-001..013 fixes. All twelve prior
findings remain closed (eleven Resolved, NS-013 Deferred) and the fixes hold up
against the current source — error classification is now typed, the SMTP client is
created once and always disconnected/disposed, the S&F `Notification` delivery
handler is registered in `AkkaHostedService`, and the OAuth2 token cache is keyed
per credential. The re-review surfaced **five new findings**. The recurring theme is
**unclassified-exception handling**: the typed `ClassifySmtpError` helper introduced
by NS-002/003 returns `Unknown` for anything that is not a recognised SMTP/socket
failure, but neither `SendAsync` nor `DeliverBufferedAsync` has a catch-all for the
`Unknown` bucket. As a result an OAuth2 token-fetch failure (`HttpRequestException`
from `EnsureSuccessStatusCode`, or `InvalidOperationException` from a malformed
credential string) escapes `SendAsync` as a raw exception to the calling script
(NS-015) and escapes `DeliverBufferedAsync` out of the S&F handler, where the engine
treats any thrown exception as transient and retries a permanently-broken config
forever (NS-014) — the same defect class NS-008 fixed for address parsing, but the
OAuth2 path was never covered. Secondary findings: `AuthenticateAsync` silently
proceeds unauthenticated for an unknown `authType` or empty credentials (NS-016);
`NotificationOptions` is bound from configuration in two places but never read by
any code (NS-017, dead config — NS-007 sourced the timeout/limit from
`SmtpConfiguration` instead); and the lazily-created concurrency limiter is read
outside its lock, is sized once and never resized on redeployment, and is never
disposed (NS-018).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -465,3 +490,108 @@ filed) and the genuinely-missing tests were added:
`AkkaHostedService` handler registration), so no further test is required here.
Module test suite is green at 56 tests.
### NotificationService-014 — OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:214-228`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:56-84` |
**Description**
`DeliverBufferedAsync` is the registered Store-and-Forward delivery handler for `StoreAndForwardCategory.Notification`. Per the S&F handler contract (`StoreAndForwardService.cs:40-53`), a handler must return `true` for success, `false` for permanent failure (park, no further retries), and **throw only for a transient failure** (retry). `DeliverBufferedAsync`'s `try` block (`:214-228`) catches only `SmtpPermanentException`; the comment at `:227` asserts "Transient SMTP errors propagate out of `DeliverAsync`". But `DeliverAsync` can also throw exceptions that `ClassifySmtpError` buckets as `Unknown` — most importantly the OAuth2 path at `:308-312`, where `OAuth2TokenService.GetTokenAsync` throws `HttpRequestException` (from `EnsureSuccessStatusCode` at `OAuth2TokenService.cs:78`) or `InvalidOperationException` (malformed credential triple, or missing `access_token`). None of these is an `SmtpCommandException`/`SmtpProtocolException`/`SocketException`/`TimeoutException`, so they are neither caught nor classified — they propagate straight out of `DeliverBufferedAsync`. The S&F engine treats *any* thrown exception as a transient failure, so a notification whose SMTP config has an invalid client secret (a 401 → `HttpRequestException`, a genuinely **permanent** condition) is retried on every sweep until `MaxRetries` is reached — burning OAuth2 token-endpoint calls and never parking promptly. A malformed credential string (`InvalidOperationException`, never fixable) is likewise retried instead of parked immediately.
**Recommendation**
Add a catch-all to `DeliverBufferedAsync` for exceptions that `ClassifySmtpError` classifies as `Unknown`: log and return `false` (park) for non-retryable causes such as `InvalidOperationException` from credential parsing, and decide deliberately whether an OAuth2 `HttpRequestException` should park or retry (a 5xx token endpoint is transient; a 401 is permanent — inspect `HttpRequestException.StatusCode`). Do not let an unclassified exception leave the handler and be silently reinterpreted as transient.
**Resolution**
_Unresolved._
### NotificationService-015 — Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96-148`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312` |
**Description**
`SendAsync`'s `try`/`catch` around `DeliverAsync` has exactly three catch clauses: `SmtpPermanentException` (`:101`), `OperationCanceledException` when cancellation was requested (`:111`), and `Exception` filtered by `IsTransientSmtpError` (`:116`). There is no catch-all for an exception that is none of those. `ClassifySmtpError` returns `SmtpErrorClass.Unknown` for anything it does not recognise, and `IsTransientSmtpError` only matches `Transient`, so an `Unknown`-classified exception falls through all three clauses and escapes `SendAsync` unhandled. Concretely, the OAuth2 branch at `:308-312` calls `OAuth2TokenService.GetTokenAsync`, which throws `HttpRequestException` (token endpoint unreachable / 4xx / 5xx) or `InvalidOperationException` (malformed `tenant:client:secret` triple, or no `access_token` in the response). Both reach the calling script (an instance/alarm/shared script) as a raw exception of a type the `INotificationDeliveryService` contract never advertises — the design doc says delivery returns either success, a buffered transient result, or a permanent `NotificationResult` error. `Notify.To().Send()` against an OAuth2 config with a bad secret therefore crashes the script instead of returning a clean `NotificationResult(false, ...)`. This is the same defect class NS-008 resolved for malformed addresses, but NS-008's `ValidateAddresses` fix never covered the OAuth2 token path. (A non-cancellation `OperationCanceledException` — e.g. an internal MailKit timeout surfaced as OCE while the caller's token is *not* cancelled — would escape the same way.)
**Recommendation**
Add a final `catch (Exception ex)` to `SendAsync` that converts any otherwise-unhandled exception into a permanent `NotificationResult(false, ...)` (credential-scrubbed, consistent with NS-009), and log it. Treating an unclassified failure as permanent is the safe default — it returns control to the script rather than crashing it; an OAuth2 transient (token endpoint 5xx) being reported as permanent is an acceptable trade against an unhandled crash, and can be refined by inspecting `HttpRequestException.StatusCode` if transient buffering of token failures is wanted.
**Resolution**
_Unresolved._
### NotificationService-016 — `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:46-67` |
**Description**
`MailKitSmtpClientWrapper.AuthenticateAsync` returns without authenticating in two silent paths. (1) `if (string.IsNullOrEmpty(credentials)) return;` (`:48-49`) — if an SMTP config has `AuthType` "basic" or "oauth2" but a null/empty `Credentials` value (a misconfigured or partially-deployed row), the method returns and `DeliverAsync` proceeds straight to `SendAsync` on an *unauthenticated* connection. (2) The `switch (authType.ToLowerInvariant())` (`:51-66`) has cases only for `"basic"` and `"oauth2"` and **no `default`** — any other `AuthType` value (a typo, a future "ntlm", an empty string) falls through with no authentication attempted. Additionally, in the `"basic"` case, `credentials.Split(':', 2)` that yields fewer than 2 parts (a credential string with no colon) is silently skipped (`:55-58`). In every case the connection then attempts to send mail unauthenticated: against a relay that does not require auth this masks a misconfiguration; against a server that does require auth it fails later with a less obvious error, and at worst an unauthenticated send succeeds where it should not. Authentication being skipped should never be silent.
**Recommendation**
Make missing/unparseable credentials and an unrecognised `AuthType` hard errors: throw a typed exception (e.g. a permanent configuration exception so `SendAsync`/`DeliverBufferedAsync` surface it as a clean failure / park) rather than returning. Add a `default:` arm to the `switch` that throws `ArgumentException` naming the unsupported auth type, and reject a "basic" credential string that does not split into exactly two parts.
**Resolution**
_Unresolved._
### NotificationService-017 — `NotificationOptions` is bound from configuration but never read (dead config)
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:1-15`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:10-11`, `src/ScadaLink.Host/SiteServiceRegistration.cs:70` |
**Description**
`NotificationOptions` (with `ConnectionTimeoutSeconds` and `MaxConcurrentConnections`) is bound from the `ScadaLink:Notification` configuration section in two places — `ServiceCollectionExtensions.AddNotificationService` (`AddOptions<NotificationOptions>().BindConfiguration(...)`) and again in `Host/SiteServiceRegistration.cs:70` (`services.Configure<NotificationOptions>(...)`). However, a repo-wide search shows no code ever injects `IOptions<NotificationOptions>` or otherwise reads either property. When NS-007 enforced the connection timeout and concurrency limit, it sourced both values from the per-site `SmtpConfiguration` entity, not from `NotificationOptions` — so this options class is now entirely dead configuration. Its XML doc still claims it "provides fallback defaults and operational limits", which is misleading: nothing falls back to it. The double binding is also redundant. Dead, falsely-documented configuration invites an operator to set `ScadaLink:Notification:ConnectionTimeoutSeconds` and expect it to take effect, when it has no effect at all.
**Recommendation**
Either delete `NotificationOptions` and both of its registrations, or genuinely wire it in as the documented fallback (e.g. `DeliverAsync` uses `NotificationOptions` values when the `SmtpConfiguration` field is non-positive). If kept, remove the duplicate `Configure` call and correct the XML doc. The NS-007 resolution note already states the `NotificationOptions` fields "remain as operational fallback defaults" — that intent should be implemented or the class removed.
**Resolution**
_Unresolved._
### NotificationService-018 — Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:237-255` |
**Description**
Three minor issues with the NS-007 concurrency limiter. (1) `GetConcurrencyLimiter` reads the plain (non-`volatile`) field `_concurrencyLimiter` at `:242` *outside* `_limiterLock` as a fast path. The field is only written inside the lock (`:252`), so under the .NET memory model a concurrent thread is not guaranteed to observe the fully-constructed `SemaphoreSlim` via the lock-free read. In practice on x86/x64 this is benign and the double-checked write inside the lock is correct, but the unsynchronized read of a reference set under a lock is the classic lazy-init memory-model pitfall — use `Volatile.Read`/`LazyInitializer`/`Lazy<T>`. (2) The limiter is sized once from the *first* `SmtpConfiguration` ever seen and the `??=` at `:252` means it is never re-created. The design doc says one SMTP config is deployed per site, but that config can be **redeployed** with a different `MaxConcurrentConnections`; the new value is silently ignored for the lifetime of the (scoped, but potentially long-lived per request) service — and more importantly the limit is captured per `NotificationDeliveryService` instance, so behaviour depends on which scoped instance happens to create it first. (3) `SemaphoreSlim` implements `IDisposable`; `_concurrencyLimiter` is never disposed and `NotificationDeliveryService` does not implement `IDisposable`. A scoped service leaking one undisposed `SemaphoreSlim` per scope is a slow handle leak.
**Recommendation**
Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyInitializer.EnsureInitialized`, or at minimum make the field access `Volatile.Read`/`Volatile.Write`. Consider hoisting the limiter to a singleton keyed by site (the limit is a per-site invariant, not a per-scope one) so redeployment with a changed limit and disposal are handled in one place; if it stays on the scoped service, implement `IDisposable` to dispose it.
**Resolution**
_Unresolved._
+98 -29
View File
@@ -40,34 +40,34 @@ module file and counted in **Total**.
| Severity | Open findings |
|----------|---------------|
| Critical | 0 |
| High | 0 |
| Medium | 0 |
| Low | 0 |
| **Total** | **0** |
| High | 8 |
| Medium | 26 |
| Low | 32 |
| **Total** | **66** |
## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------|
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 8 |
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 16 |
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/3 | 6 | 25 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 14 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/1 | 4 | 17 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 15 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/3/1 | 4 | 17 |
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 |
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 |
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 14 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 19 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 17 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/0 | 2 | 16 |
## Pending Findings
@@ -80,14 +80,83 @@ description, location, recommendation — lives in the module's `findings.md`.
_None open._
### High (0)
### High (8)
_None open._
| ID | Module | Title |
|----|--------|-------|
| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
| ManagementService-014 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments bypasses site-scope enforcement |
| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever |
| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script |
### Medium (0)
### Medium (26)
_None open._
| ID | Module | Title |
|----|--------|-------|
| CLI-014 | [CLI](CLI/findings.md) | `update` commands require "core" fields, making partial updates impossible |
| CentralUI-021 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callback mutates `Dictionary` off the render thread |
| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
| ConfigurationDatabase-012 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext |
| ConfigurationDatabase-013 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-column encryption silently falls back to an ephemeral (throwaway) key |
| DataConnectionLayer-015 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup |
| DataConnectionLayer-016 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure |
| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
| Host-012 | [Host](Host/findings.md) | `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read |
| InboundAPI-014 | [InboundAPI](InboundAPI/findings.md) | `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated |
| InboundAPI-015 | [InboundAPI](InboundAPI/findings.md) | `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token |
| InboundAPI-016 | [InboundAPI](InboundAPI/findings.md) | Routed `Route.To().Call()` invocations are not bound by the method timeout |
| ManagementService-015 | [ManagementService](ManagementService/findings.md) | HandleSetInstanceOverrides applies overrides non-atomically |
| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials |
| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session |
| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout |
| SiteEventLogging-012 | [SiteEventLogging](SiteEventLogging/findings.md) | Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted |
| SiteRuntime-017 | [SiteRuntime](SiteRuntime/findings.md) | Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors |
| StoreAndForward-015 | [StoreAndForward](StoreAndForward/findings.md) | `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" |
| StoreAndForward-016 | [StoreAndForward](StoreAndForward/findings.md) | Operator-initiated parked-message retry and discard are not replicated to the standby |
| TemplateEngine-015 | [TemplateEngine](TemplateEngine/findings.md) | `RenameCompositionAsync` does not cascade-rename nested derived templates |
| TemplateEngine-016 | [TemplateEngine](TemplateEngine/findings.md) | Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules |
### Low (0)
### Low (32)
_None open._
| ID | Module | Title |
|----|--------|-------|
| CLI-015 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface has drifted again in two places |
| CLI-016 | [CLI](CLI/findings.md) | `WriteAsTable` derives columns from the first array element only |
| CentralUI-023 | [CentralUI](CentralUI/findings.md) | Residual bare `catch {}` blocks swallow JS interop errors |
| CentralUI-024 | [CentralUI](CentralUI/findings.md) | Claim lookups use magic strings instead of `JwtTokenService` constants |
| CentralUI-025 | [CentralUI](CentralUI/findings.md) | `SessionExpiry` polling/redirect path has no test coverage |
| ClusterInfrastructure-010 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator does not enforce `DownIfAlone = true` despite the design doc requiring it |
| Commons-013 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryGetIndex` rejects non-`int` index values |
| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
| ConfigurationDatabase-014 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Redundant, inconsistent cast on one `HasConversion` call |
| DataConnectionLayer-017 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect |
| DeploymentManager-017 | [DeploymentManager](DeploymentManager/findings.md) | `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement |
| ExternalSystemGateway-017 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null |
| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
| Host-013 | [Host](Host/findings.md) | `:F0` rounding of cluster timing values silently degrades sub-second configuration |
| Host-014 | [Host](Host/findings.md) | Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) |
| Host-015 | [Host](Host/findings.md) | `StartupRetry` retries on every exception type, including permanent failures |
| InboundAPI-017 | [InboundAPI](InboundAPI/findings.md) | `RouteHelper` / `RouteTarget` has no test coverage |
| ManagementService-016 | [ManagementService](ManagementService/findings.md) | Unexpected exception messages returned verbatim to HTTP callers |
| ManagementService-017 | [ManagementService](ManagementService/findings.md) | QueryDeploymentsCommand has no test coverage |
| NotificationService-017 | [NotificationService](NotificationService/findings.md) | `NotificationOptions` is bound from configuration but never read (dead config) |
| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed |
| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas |
| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims |
| SiteEventLogging-013 | [SiteEventLogging](SiteEventLogging/findings.md) | Keyword search does not escape SQL `LIKE` wildcards in user input |
| SiteEventLogging-014 | [SiteEventLogging](SiteEventLogging/findings.md) | Initial purge runs synchronously on the host startup thread |
| SiteRuntime-018 | [SiteRuntime](SiteRuntime/findings.md) | `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher" |
| SiteRuntime-019 | [SiteRuntime](SiteRuntime/findings.md) | Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor` |
| StoreAndForward-017 | [StoreAndForward](StoreAndForward/findings.md) | Retry/Discard activity-log entries hard-code the `ExternalSystem` category |
+161 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Security` |
| Design doc | `docs/requirements/Component-Security.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (1 deferred — Security-008) |
| Commit reviewed | `39d737e` |
| Open findings | 4 (1 deferred — Security-008) |
## Summary
@@ -28,6 +28,26 @@ on every refresh, weakening the documented 30-minute idle policy. None of these
crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
defects that should be fixed before any production deployment.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module after the batch-resolution of Security-001..007, 009, 010, 011.
Those fixes were verified in place and remain Resolved: the `LdapTransport` enum makes
StartTLS reachable, the auth cookie is `SecurePolicy.Always` with a sliding window, the
JWT signing key is length-validated at construction, issuer/audience are bound and
checked, the DN-injection fallback is RFC 4514-escaped, and the idle-timeout anchor is
carried across refreshes. Security-008 (N+1 scope-rule query) remains correctly
**Deferred** — `ISecurityRepository` still exposes no per-set scope-rule query, so the
N+1 cannot be removed from within `RoleMapper.cs` alone; the deferral condition is
unchanged. This pass surfaced **4 new findings** (Security-012..015): two Medium and
two Low. The most notable is that a *partial* LDAP outage during login (bind succeeds,
group search fails) silently returns an authenticated session with **zero roles**
instead of failing the login as the design's LDAP-failure rule requires
(Security-012); and that `RefreshToken` re-issues a token without ever checking
`IsIdleTimedOut`, so an idle-expired session can be renewed indefinitely if the caller
omits the separate idle check (Security-014). The two Low findings concern fragile
DN parsing of group names containing escaped commas and an un-trimmed username flowing
into the LDAP filter, fallback DN, and JWT claims.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -456,3 +476,141 @@ suite. Remaining gaps closed here: renamed the test file `UnitTest1.cs` →
LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases
(`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`,
`EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green.
### Security-012 — Partial LDAP failure during login yields a roleless authenticated session
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:78-118` |
**Description**
In `AuthenticateAsync`, after the user bind succeeds, the group/attribute `Search` is
wrapped in a `try`/`catch (LdapException)` that logs a warning and continues — the
comment states "Auth succeeded even if attribute lookup failed". The method then returns
`new LdapAuthResult(true, displayName, username, groups, null)` with an **empty
`groups`** list. Because `RoleMapper.MapGroupsToRolesAsync` derives all roles from that
group list, a login during a *partial* LDAP outage (bind reachable, search/group server
unavailable, or a transient `LdapException` mid-search) produces a fully authenticated
session with **zero roles** — the user is logged in but silently stripped of all
permissions. The design's LDAP Connection Failure rule states new logins must **fail**
when LDAP is unavailable; this path instead admits the user. The caller cannot
distinguish "user genuinely belongs to no mapped groups" from "the group query failed",
so it cannot make the documented fail-the-login decision. The `while` loop's inner
`catch (LdapException) { break; }` (line 107-111) compounds this: a real error mid-enumeration
is indistinguishable from end-of-results and silently truncates the group list.
**Recommendation**
Treat a failed group/attribute lookup on the *initial login* as an authentication
failure: return `Success = false` with a "directory unavailable, try again" message, or
add an explicit flag on `LdapAuthResult` (e.g. `GroupLookupSucceeded`) so the caller can
apply the design's fail-the-login rule. The inner `while`-loop `catch` should not treat
an arbitrary `LdapException` as a benign end-of-results sentinel — only the specific
"no more results" condition should break the loop; any other exception should propagate
or be surfaced.
**Resolution**
_Unresolved._
### Security-013 — `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:258-269` |
**Description**
`ExtractFirstRdnValue` extracts a group name from a `memberOf` DN by taking the substring
between the first `=` and the first `,`. RFC 4514 permits a `,` inside an RDN value when
it is backslash-escaped — e.g. `cn=Acme\, Inc Operators,ou=groups,dc=example,dc=com`. The
method splits on the *escaped* comma and returns `Acme\` as the group name. `RoleMapper`
then matches that mangled name verbatim against configured `LdapGroupMapping.LdapGroupName`
values, so a group whose CN legitimately contains a comma silently fails to map to its
role — the user loses that role with no error. The same naive parsing also does not strip
the trailing backslash escape.
**Recommendation**
Parse the first RDN with an escape-aware scan: ignore a `,` preceded by an unescaped
`\`, and unescape RFC 4514 sequences in the extracted value. Alternatively use the LDAP
library's DN-parsing API (`LdapDN` / RDN accessors) rather than hand-rolled `IndexOf`.
**Resolution**
_Unresolved._
### Security-014 — `RefreshToken` re-issues a token without checking the idle timeout
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/JwtTokenService.cs:156-169` |
**Description**
`RefreshToken` carries the existing `LastActivity` claim forward (correct, per the
Security-007 fix) and unconditionally issues a fresh token with a new 15-minute expiry.
It never calls `IsIdleTimedOut`. The documented 30-minute idle policy therefore depends
entirely on the CentralUI request pipeline calling `IsIdleTimedOut` *before* calling
`RefreshToken` — two independent, order-dependent checks with no enforcement that they
are used together. If a caller refreshes without first checking idle state (an easy
mistake, and there is no compiler or API signal preventing it), an idle-expired session
is silently renewed: each refresh resets the JWT expiry while `LastActivity` keeps
ageing, so the session can be kept alive indefinitely by background refreshes even
though the user has been idle for hours. `RefreshToken` is the natural place to enforce
the invariant but provides no safety net.
**Recommendation**
Have `RefreshToken` itself reject a principal that is already past the idle timeout —
return `null` (the same "cannot refresh" signal it already uses for missing claims) when
`IsIdleTimedOut(currentPrincipal)` is true — so the idle policy holds regardless of
caller discipline. Document that an idle-expired principal cannot be refreshed and add a
regression test covering refresh of a 31-minute-idle token.
**Resolution**
_Unresolved._
### Security-015 — Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:20-21`, `:80`, `:122`, `:169`, `:193` |
**Description**
`AuthenticateAsync` rejects null/whitespace-only usernames via `IsNullOrWhiteSpace`, but
an otherwise-valid username with leading or trailing whitespace (`" alice"`, copy-paste
artefacts, mobile keyboards) is passed through verbatim. It flows into the LDAP search
filter `({attr}={EscapeLdapFilter(username)})`, into the fallback bind DN via
`BuildFallbackUserDn`, and — most consequentially — into the returned `LdapAuthResult`
and from there into the JWT `Username` claim. Most LDAP directories ignore surrounding
whitespace on a search term, so `" alice"` and `"alice"` may both authenticate but yield
JWT `Username` claims that differ by whitespace. Any downstream identity comparison
(audit-log `who`, site-scope lookups keyed by username, session de-duplication) then
sees two distinct identities for the same person.
**Recommendation**
Trim the username once at the top of `AuthenticateAsync` (after the `IsNullOrWhiteSpace`
guard) and use the trimmed value consistently for the filter, the DN, and the
`LdapAuthResult`. This normalises the identity that ends up in the JWT claim and audit
trail.
**Resolution**
_Unresolved._
+141 -5
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
hold up under inspection — the background writer, lock-guarded `WithConnection`,
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
the concrete-recorder DI wiring are all present and correct at this commit. The
module source is byte-identical between `39d737e` and current `HEAD`, so this review
reflects the live code. Three new findings were recorded, all low-to-medium and none
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
be persisted because the logger has been disposed, the returned `Task` is completed
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
keyword-search filter (SiteEventLogging-013) and the initial purge running
synchronously on the host startup thread (SiteEventLogging-014).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
@@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac
is actually registered. Documentation/dead-code change only; no regression test was
added — the change is a method removal verified by the compiler (no callers) and the
full module suite still passing.
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
**Description**
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
SiteEventLogging-009), "completes once the event is durably persisted and faults if
the write fails, so callers that `await` it observe success or failure." Two paths
break that contract by signalling **success** for an event that was never written:
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
has been completed because the logger was disposed), the code calls
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
though the comment immediately above acknowledges "there is nowhere to persist the
event."
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
been disposed mid-drain. The code does not inspect the returned `written` flag and
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
for an event the comment admits "simply cannot be persisted."
The event log is the site's diagnostic audit trail. A caller that `await`s
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
recorded will observe a *successful* completion for an event that was silently
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
*errors* — but the disposed-drop path was left reporting false success. The window
is the disposal/shutdown interval, during which shutdown-related events (graceful
singleton handover, instance disable) are exactly the events most likely to be
enqueued and lost.
**Recommendation**
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
deliberate "drop silently on shutdown" semantics is chosen instead.
**Resolution**
_Unresolved._
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
**Description**
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
user-supplied keyword. A search for a literal `_` (common in event sources and
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
"match any single character", and a `%` matches any run of characters. The design
calls keyword search "free-text search on message and source fields ... Useful for
finding events by script name, alarm name, or error message" — users will reasonably
expect a literal substring match, so a query for `store_and_forward` silently returns
events containing `storeXandYforward` and similar false positives. There is no way
for the caller to search for a literal underscore or percent.
**Recommendation**
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
design implies.
**Resolution**
_Unresolved._
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
**Description**
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
near-1 GB database is back under the cap — executes inline on the startup thread,
blocking host startup (and therefore the `/health/ready` gate) for as long as the
purge takes. On a site that has accumulated a large log this can be a multi-second
stall during every node start/failover. The class doc states the service "runs on a
background thread and does not block event recording" — the startup-thread block is
inconsistent with that intent.
**Recommendation**
Yield before the initial purge so it runs on the background scheduler rather than the
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
**Resolution**
_Unresolved._
+144 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 3 |
## Summary
@@ -28,6 +28,24 @@ in a comment but ships it anyway). Test coverage exists for the coordinator acto
persistence and scripting, but the short-lived execution actors, the replication
actor, and the repositories are untested.
#### Re-review 2026-05-17 (commit `39d737e`)
The module was re-reviewed at commit `39d737e`. No source under
`src/ScadaLink.SiteRuntime` has changed since the previous review at `9c60592`
(the only intervening commits are code-review documentation updates), so all of
SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains
Deferred — its Deferred justification (a trigger-evaluation concurrency design
decision is required before either recommended fix can land in-module) still
holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The
re-review nonetheless worked through all 10 checklist categories afresh and
surfaced three new findings that the prior pass did not record: a cross-thread
`Dictionary` enumeration race when the Instance Actor's live `_attributes`
dictionary is handed by reference into child `ScriptActor`/`AlarmActor`
constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc
that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low);
and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
never routes to (SiteRuntime-019, Low). Open findings: 3.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -733,3 +751,126 @@ harness is a larger test-infrastructure task tracked separately and out of scope
Low-severity coverage finding; the highest-value untested paths the finding called out
(script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
green.
### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:93` |
**Description**
`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable
`_attributes` field (a plain `Dictionary<string, object?>`) by reference into the
`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the
`initialAttributes` constructor argument). Each child constructor then iterates
that dictionary to seed its `_attributeSnapshot`:
```csharp
if (initialAttributes != null)
foreach (var kvp in initialAttributes)
_attributeSnapshot[kvp.Key] = kvp.Value;
```
`Context.ActorOf` returns immediately; the child actor's constructor runs later on
the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and
the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and
`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`).
A DCL tag update that arrives before a child has finished its constructor copy
therefore mutates the dictionary on the Instance Actor thread while the child
thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent
read/write: the enumeration can throw `InvalidOperationException` ("collection was
modified") — which surfaces as an `ActorInitializationException` and, under the
Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns
`Stop` for `ActorInitializationException`). The script or alarm is then silently
absent for the life of the instance. A torn read of an entry is also possible. The
window is small but deterministically reachable on a busy site at startup/failover
— exactly the staggered-startup scenario the design is most concerned about.
**Recommendation**
Do not share the live dictionary. Snapshot it on the Instance Actor thread before
constructing the child — e.g. pass `new Dictionary<string, object?>(_attributes)`
(or an immutable copy) into each `Props.Create`. The copy is made on the Instance
Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a
private dictionary to seed from.
**Resolution**
_Unresolved._
### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher"
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:17` |
**Description**
The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated
blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was
resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated
`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka
dispatcher, and only the script body runs on the scheduler's threads via
`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009
explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and
even removed the "in production, configure a dedicated dispatcher" comments
elsewhere — but this stale summary line was missed. A reader is told the actor is
on a dedicated dispatcher when it is not, which is misleading when reasoning about
mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry
the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")
**Recommendation**
Correct the summary to describe the actual model: the actor runs on the default
dispatcher and the script body is dispatched onto the dedicated
`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate
comment already present at `ScriptExecutionActor.cs:71-73`.
**Resolution**
_Unresolved._
### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:113` |
**Description**
`InstanceActor`'s constructor registers `Receive<DisableInstanceCommand>` and
`Receive<EnableInstanceCommand>` handlers that log and reply with a successful
`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager
is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` /
`HandleEnable` handle the lifecycle entirely themselves — they call
`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and
reply to the original sender from the Deployment Manager. Neither command is ever
`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and
they are actively misleading: a maintainer reading `InstanceActor` would reasonably
believe disable/enable is partly an Instance-Actor responsibility, and the no-op
"true" reply implies an instance-side acknowledgement contract that does not exist.
If a future change *did* route these commands here, the disable handler would do
nothing useful (it does not stop children or tear down state — Akka does that when
the parent stops the actor).
**Recommendation**
Remove the two `Receive<...>` registrations and their handler bodies from
`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle.
If the intent is to keep them for a future instance-side hook, add an XML comment
stating that the Deployment Manager currently handles these and the handlers are a
reserved placeholder — but removal is preferred.
**Resolution**
_Unresolved._
+186 -6
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
| Commit reviewed | `39d737e` |
| Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) |
## Summary
@@ -30,20 +30,45 @@ status set, and untested critical paths (retry-due timing, replication-from-acti
the actor bridge). None of the findings are blockers for compilation, but the
replication and retry-count issues are functional defects against the design.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
003010, plus 013014, are confirmed `Resolved` against the current source: the
replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
directory creation are all present as described. Findings 002, 011 and 012 were
re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
This pass surfaced **three new findings**. StoreAndForward-015 records the
StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
retry and discard are not replicated to the standby — only the add/remove/park sweep
paths are — so a failover diverges the standby buffer from the active one.
StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
the site event log.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |
## Findings
@@ -703,3 +728,158 @@ and bare filenames are skipped). Regression test
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
`StoreAndForward-014`.
### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114``:130`, `:285` |
**Description**
The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
its callers, and the StoreAndForward side carries a genuine documentation/API-contract
fault:
- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
sweep, so a `0` value means **"no limit — retry forever"**. The
`StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
`"0 = no limit"`, so the persistence model is internally consistent.
- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
`maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
documentation at all**, and its method summary (lines 114122) describes the lifecycle
only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
`_deliveryHandlers` field doc, line 5051). Nothing on the public surface tells a
caller that passing `0` flips the meaning from "park immediately / never retry" to
"retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
retries means zero retries.
- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
`DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
intending `0` to mean "never retry", and instead get unbounded retry — the
duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
warns against. The fault is not solely ESG's: the S&F public API silently overloads
`0` with the opposite of its natural reading and does not document it.
The defect is in this module's API contract and documentation, so it is recorded here
in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
"no retry" is the cross-module design decision tracked by ESG-015; this finding is
specifically that the StoreAndForward public surface fails to document whichever
meaning is chosen.)
**Recommendation**
Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
`0` case to the method summary's lifecycle description. Better still — and consistent
with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
park on the first sweep) so the natural reading and the behaviour agree. Either way the
public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
**Resolution**
_Unresolved._
### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339``:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131``:136` |
**Description**
`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
`_replication`:
- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
`retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
`Parked`.
- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
copy is left in place, still `Parked`.
The Component design doc ("Persistence") requires the active node to forward "each
buffer operation (add, remove, park)" so that on failover "the new active node has a
near-complete copy of the buffer." An operator retrying a parked message is a buffer
state change; an operator discarding one is a removal. After a failover that follows an
operator action:
1. A **discarded** message reappears on the new active node — it is still `Parked`
there, so it resurfaces in the central UI's parked-message list and an operator must
discard it a second time. For a message deliberately removed (e.g. a known-bad
payload) this is a correctness regression of the operator's intent.
2. A **retried** message is still `Parked` on the new active node, so the operator's
"move it back to the queue" action is silently lost across the failover and the
message is not re-attempted.
`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
"un-park / move back to pending", so even a minimal fix needs either a new operation
type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
paths.
**Recommendation**
Replicate both operator actions. `DiscardParkedMessageAsync` should call
`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
the updated message and replicate it as an `Add`-style upsert. Add replication tests for
both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
paths).
**Resolution**
_Unresolved._
### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
**Description**
`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
`StoreAndForwardCategory.ExternalSystem` as the category argument:
```csharp
RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
```
Both methods take only a `messageId` and never load the message, so they have no access
to its real category. When an operator retries or discards a parked **Notification** or
**CachedDbWrite** message, the site event log records the activity under
`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
inconsistent and produce mislabelled audit entries — misleading when an operator later
filters or reviews S&F activity by category.
**Recommendation**
Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
`DiscardParkedMessageAsync` return the affected row's category) and pass the real
`Category` to `RaiseActivity`. If loading the row is considered too costly on these
infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
allow a nullable category for management actions rather than asserting a false one.
**Resolution**
_Unresolved._
+122 -4
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.TemplateEngine` |
| Design doc | `docs/requirements/Component-TemplateEngine.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 2 |
## Summary
@@ -29,11 +29,30 @@ create, optimistic concurrency on instance state) are claimed but not implemente
Themes: validation that is weaker than the design promises, and asymmetric handling
of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the whole module against all ten checklist categories at commit
`39d737e`. All fourteen prior findings remain closed — the batch-4 fixes
(`bc88a36`/`804697f` and predecessors) hold up: the recursive composition walk,
the per-slot alarm override mechanism, the code-region-aware delimiter scanner,
and the single-source deletion-constraint logic are all correctly in place. Two
new Medium findings surfaced, both in the composition-cascade path and both
affecting **nested** (depth ≥ 2) compositions specifically — the same blind spot
that produced TemplateEngine-001. **TemplateEngine-015**: `RenameCompositionAsync`
renames only the directly slot-owned derived template, leaving cascaded inner
derived templates with a stale dotted-path name. **TemplateEngine-016**:
`FlatteningService` hard-codes `ScriptScope.ParentPath` to the empty string for
every composed script regardless of nesting depth, so a script two or more
levels deep cannot resolve `Parent.X` references to its real parent module.
Both are limited-impact (nested compositions are the less common case and there
is design-time visibility) but represent genuine drift from the recursive-nesting
design promise.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. |
| 1 | Correctness & logic bugs | ✓ | Prior bugs (001005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
| 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. |
| 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. |
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. |
@@ -648,3 +667,102 @@ reports all blocking reasons and uses `TemplateDeletionService`'s phrasing — t
affected `TemplateServiceTests` delete tests were updated to the unified messages,
and a regression test `DeleteTemplate_MultipleConstraints_ReportsAllNotJustFirst`
verifies all three constraint categories are surfaced together.
### TemplateEngine-015 — `RenameCompositionAsync` does not cascade-rename nested derived templates
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:680` |
**Description**
`AddCompositionAsync` builds a cascade of derived templates whose names follow a
dotted path: composing `$Sensor` (which itself composes `$Probe` as `Probe1`)
into `$Pump` as `TempSensor` produces `$Pump.TempSensor` **and** the nested
`$Pump.TempSensor.Probe1` (see `CreateCascadedCompositionAsync` and the
`AddComposition_CascadesChildCompositions` test). `RenameCompositionAsync`,
however, renames only the **directly** slot-owned derived template:
```csharp
var derived = await _repository.GetTemplateByIdAsync(composition.ComposedTemplateId, ...);
if (derived != null && derived.IsDerived && derived.OwnerCompositionId == compositionId)
{
var newDerivedName = $"{owner.Name}.{newInstanceName}";
...
derived.Name = newDerivedName;
await _repository.UpdateTemplateAsync(derived, ...);
}
```
There is no recursion into `derived.Compositions`. After renaming the `TempSensor`
slot to `MainSensor`, the parent derived becomes `$Pump.MainSensor` but the
cascaded child stays `$Pump.TempSensor.Probe1` — its name no longer reflects the
slot path it lives under, breaking the dotted-path naming invariant the cascade
otherwise maintains. `DeleteCompositionAsync` correctly recurses
(`CascadeDeleteDerivedAsync`), so rename is the asymmetric outlier. The
`RenameComposition_RenamesSlotAndDerivedTemplate` test only exercises a
single-level derived, so the gap is untested. The stale name also breaks the
`AddComposition_DerivedNameCollision_Fails` / cascade-name pre-check on any
subsequent compose that walks the now-inconsistent name tree.
**Recommendation**
Recurse over `derived.Compositions` (mirroring `CascadeDeleteDerivedAsync`),
re-deriving each cascaded child's name from the renamed parent
(`$"{parentDerivedName}.{childComposition.InstanceName}"`), and run the
existing same-name collision pre-check across every name the cascade will
produce — not just the top-level one. Add a regression test covering a
two-level cascade rename.
**Resolution**
_Unresolved._
### TemplateEngine-016 — Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:750` |
**Description**
`ResolveComposedScriptsRecursive` assigns each composed script a `ScriptScope`:
```csharp
Scope = new Commons.Types.Scripts.ScriptScope(SelfPath: prefix, ParentPath: "")
```
`prefix` is the accumulated path-qualified module path (`Outer` at depth 1,
`Outer.Inner` at depth 2, etc.), so `SelfPath` is correct. `ParentPath`, however,
is hard-coded to the empty string at every depth. Per `ScriptScope`'s own XML
doc, `ParentPath` is "computed at flattening time and seeded into the script's
globals … so `Attributes["X"]` / `Parent.X` can prepend the right path-prefix."
For a script directly composed at depth 1 the parent is the root and `""` is
correct, but for a script in a nested module (`Outer.Inner.Foo`) the parent
module is `Outer` — yet `ParentPath` is still `""`. A nested composed script
that references `Parent.X` will therefore resolve the reference against the root
flat namespace instead of its actual parent module, reading the wrong attribute
(or failing to find one). This is the same depth-≥2 nesting blind spot as
TemplateEngine-001; the recursive walk was added there but the `Scope`
construction was not updated to carry the parent path. `ResolveComposedScripts`
for direct (root-template) scripts leaves `Scope` at the default `ScriptScope.Root`,
which is correct.
**Recommendation**
Thread the parent module path through `ResolveComposedScriptsRecursive` (the
caller already knows it — it is the `prefix` of the enclosing recursion frame,
or `""` for a depth-1 composition) and set
`ParentPath` to that value, so `SelfPath = "Outer.Inner"` pairs with
`ParentPath = "Outer"`. Add a flattening test asserting the `Scope` of a
two-level composed script.
**Resolution**
_Unresolved._