Compare commits
4 Commits
39d737ebd6
...
0ba4e49e11
| Author | SHA1 | Date | |
|---|---|---|---|
| 0ba4e49e11 | |||
| 3b3760f026 | |||
| 89636e2bbf | |||
| e49846603e |
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.CLI` |
|
||||
| Design doc | `docs/requirements/Component-CLI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -31,8 +31,26 @@ ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliCon
|
||||
`CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON
|
||||
argument parsing, and the command-tree wiring are untested.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 13 prior findings are confirmed resolved — the resilience gaps, the dead format
|
||||
configuration, the credential-handling weakness, and the test-coverage holes have all
|
||||
been closed, and the test suite has grown substantially (`CommandTreeTests`,
|
||||
`ManagementHttpClientTests`, `DebugStreamTests`, etc.). The CLI's runtime behaviour is now
|
||||
solid. This re-review walked all 14 command groups against the full checklist and found
|
||||
three new issues, all rooted in **update-command and design-document drift** rather than
|
||||
runtime defects: every `update` command requires the entity's "core" fields (`--name`,
|
||||
`--script`) even though the design doc presents them as optional, so a partial update is
|
||||
impossible (CLI-014); the design doc's command surface has drifted again in two specific
|
||||
places — `template composition delete` and the `data-connection` config flags (CLI-015);
|
||||
and `WriteAsTable` derives table columns from only the first array element, silently
|
||||
dropping columns for any later element with a different shape (CLI-016). No
|
||||
Critical/High issues; the module remains healthy.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
_Original review (2026-05-16, `9c60592`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). |
|
||||
@@ -46,6 +64,21 @@ argument parsing, and the command-tree wiring are untested.
|
||||
| 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). |
|
||||
| 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). |
|
||||
|
||||
_Re-review (2026-05-17, `39d737e`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Update commands require "core" fields, blocking partial updates (CLI-014); `WriteAsTable` headers derived from first array element only (CLI-016). |
|
||||
| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR client. No issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | `debug stream` concurrency now resolved via `DebugStreamHelpers` (CLI-011/012). No new issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Malformed-URL / malformed-JSON / connect-cancellation paths all hardened (CLI-004/005/010). No new issues. |
|
||||
| 5 | Security | ☑ | Env-var credential fallback in place (CLI-006). Basic Auth over HTTP is by design. No new issues. |
|
||||
| 6 | Performance & resource management | ☑ | `CancellationTokenSource` now `using`-scoped (CLI-011). No new issues. |
|
||||
| 7 | Design-document adherence | ☑ | Two residual command-surface drifts: `template composition delete` and `data-connection --primary-config` (CLI-015). |
|
||||
| 8 | Code organization & conventions | ☑ | Consistent and clean; option construction centralised in `CliOptions`. No new issues. |
|
||||
| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
|
||||
| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
|
||||
|
||||
## Findings
|
||||
|
||||
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
|
||||
@@ -537,3 +570,121 @@ Resolved 2026-05-16 (commit pending). Coverage gaps confirmed and closed:
|
||||
`InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by
|
||||
`InstanceArgumentParsingTests` under CLI-005.
|
||||
The CLI test suite went from 42 to 77 passing tests.
|
||||
|
||||
### CLI-014 — `update` commands require "core" fields, making partial updates impossible
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/TemplateCommands.cs:77`, `src/ScadaLink.CLI/Commands/SiteCommands.cs:86`, `src/ScadaLink.CLI/Commands/ExternalSystemCommands.cs:40-42`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:39-40`, `src/ScadaLink.CLI/Commands/NotificationCommands.cs:40-41`, `src/ScadaLink.CLI/Commands/ApiMethodCommands.cs:79` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc presents `update` commands with all non-`--id` fields as optional, e.g.
|
||||
`template update --id <id> [--name <name>] [--description <desc>] [--parent-id <id>]`
|
||||
(`Component-CLI.md:62`) and `api-method update --id <id> [--script <code>] ...`
|
||||
(`Component-CLI.md:224`). The implementation contradicts this: the update commands mark
|
||||
the entity's "core" fields as `Required = true`, so the user must always re-supply them:
|
||||
|
||||
- `template update` — `--name` is `Required = true` (`TemplateCommands.cs:77`).
|
||||
- `site update` — `--name` is `Required = true` (`SiteCommands.cs:86`).
|
||||
- `external-system update` — `--name`, `--endpoint-url`, `--auth-type` are all
|
||||
`Required = true` (`ExternalSystemCommands.cs:40-42`).
|
||||
- `data-connection update` — `--name`, `--protocol` are `Required = true`
|
||||
(`DataConnectionCommands.cs:39-40`).
|
||||
- `notification update` — `--name`, `--emails` are `Required = true`
|
||||
(`NotificationCommands.cs:40-41`).
|
||||
- `api-method update` — `--script` is `Required = true` (`ApiMethodCommands.cs:79`).
|
||||
- The same pattern applies to `template attribute/alarm/script update`
|
||||
(`TemplateCommands.cs:164-165`, `246-248`, `332-334`) and `role-mapping update`
|
||||
(`SecurityCommands.cs:110-111`).
|
||||
|
||||
Because the corresponding `Update*Command` records are whole-replace (they carry the full
|
||||
field set, not a sparse patch), a user who wants to change only one field — e.g. flip an
|
||||
API method's timeout, or rename a template — must look up and re-pass every other field's
|
||||
current value. Omitting any required flag is a hard parse error. This makes scripted,
|
||||
single-field updates (a core CLI/CI use case) awkward and error-prone, and it does not
|
||||
match the documented optional-flag surface.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide on one model and align doc + code. Either (a) make the update flags genuinely
|
||||
optional and have the server/`Update*Command` treat a null field as "leave unchanged"
|
||||
(sparse patch), or (b) if whole-replace is intentional, update `Component-CLI.md` to show
|
||||
these flags as required (no `[...]`) and document that an update replaces the whole
|
||||
entity. Option (a) matches the documented surface and the typical CLI expectation.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-015 — `Component-CLI.md` command surface has drifted again in two places
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-CLI.md:75`, `docs/requirements/Component-CLI.md:125-126` (vs. `src/ScadaLink.CLI/Commands/TemplateCommands.cs:404-413`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:41`, `:86`) |
|
||||
|
||||
**Description**
|
||||
|
||||
CLI-007 regenerated the doc's "Command Structure" section, but two specific drifts remain
|
||||
or were introduced:
|
||||
|
||||
- `Component-CLI.md:75` documents `template composition delete --template-id <id>
|
||||
--instance-name <name>`, but the implementation (`TemplateCommands.cs:404-413`) deletes
|
||||
a composition by its own integer ID via a single `--id` option
|
||||
(`DeleteTemplateCompositionCommand(id)`). The doc's two-flag form does not exist.
|
||||
- `data-connection create` and `update` accept a `--primary-config` option (aliased
|
||||
`--configuration`) for the primary configuration JSON (`DataConnectionCommands.cs:86`,
|
||||
`:41`), but `Component-CLI.md:125-126` lists only `--backup-config` and
|
||||
`--failover-retry-count` — the primary-config flag is absent from the doc.
|
||||
|
||||
A reader following the doc would use a non-existent `template composition delete` form and
|
||||
would not discover the `--primary-config` flag.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Correct `Component-CLI.md:75` to `template composition delete --id <id>`, and add
|
||||
`[--primary-config <json>]` to the documented `data-connection create`/`update` signatures
|
||||
(`Component-CLI.md:125-126`). Also note the `--configuration` alias if aliases are
|
||||
documented elsewhere.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-016 — `WriteAsTable` derives columns from the first array element only
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:184-200` |
|
||||
|
||||
**Description**
|
||||
|
||||
When rendering a JSON array as a table, `WriteAsTable` builds the header set from
|
||||
`items[0].EnumerateObject()` only (`CommandHelpers.cs:184-186`) and then projects every
|
||||
row against that fixed header list (`:188-198`). If a later element of the array has a
|
||||
different shape — additional properties, or properties the first element lacks — those
|
||||
extra columns are silently dropped from the table and a row missing a header property
|
||||
renders an empty cell. The user sees a table that appears complete but has omitted data,
|
||||
with no indication that columns were discarded. (The JSON output path is unaffected; this
|
||||
only affects `--format table`.) Management API list responses are generally homogeneous,
|
||||
so the practical impact is low, but a heterogeneous array — e.g. a diff or a mixed-status
|
||||
list — would be rendered incorrectly with no warning.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Compute the header set as the union of property names across all array elements (iterate
|
||||
all items, collect distinct property names preserving first-seen order) before projecting
|
||||
rows, so no element's data is silently dropped.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.CentralUI` |
|
||||
| Design doc | `docs/requirements/Component-CentralUI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 6 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -32,6 +32,24 @@ Testing coverage is thin for a module this large: only the script analyzer,
|
||||
TreeView, schema model, and a few data-connection pages have unit tests; most
|
||||
pages and the auth bridge are untested.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 19 findings from the 2026-05-16 review are confirmed closed. The resolution
|
||||
batch (`a9bd7ee`..`34588ae`) substantially rewrote the auth bridge, the script
|
||||
sandbox, several Deployment/Monitoring pages, and the shared component disposal
|
||||
paths, so this re-review re-examined the post-fix state across all 10 checklist
|
||||
categories. Six new findings (CentralUI-020 .. 025) were recorded. The most
|
||||
important is **CentralUI-020**: the two prior fixes interact destructively — the
|
||||
CentralUI-004 fix made `CookieAuthenticationStateProvider` return a frozen,
|
||||
constructor-time auth-state snapshot, while the CentralUI-005 fix rewrote
|
||||
`SessionExpiry.razor` to *poll* that same provider to detect a lapsed session.
|
||||
Because the snapshot never changes for the life of the circuit, the idle-timeout
|
||||
redirect can never fire, so the documented idle-logout behaviour is silently
|
||||
defeated. The remaining new findings are a cross-thread `Dictionary` mutation in
|
||||
`DebugView`, an unguarded `InvokeAsync` in the new `Deployments` push handler,
|
||||
and three Low-severity items (residual bare `catch`, magic-string claim
|
||||
lookups, and the untested `SessionExpiry` polling path).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -47,6 +65,14 @@ pages and the auth bridge are untested.
|
||||
| 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. |
|
||||
| 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. |
|
||||
|
||||
Re-review 2026-05-17 (`39d737e`): all 10 categories re-examined against the
|
||||
post-fix source. New findings — category 3 (CentralUI-020 the auth-snapshot vs
|
||||
session-poll interaction is also a design-adherence regression; CentralUI-021
|
||||
cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
|
||||
(CentralUI-023 residual bare `catch`), category 8 (CentralUI-024 magic-string
|
||||
claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
|
||||
1, 2, 5, 6, 7, 10 produced no new findings.
|
||||
|
||||
## Findings
|
||||
|
||||
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
|
||||
@@ -961,3 +987,232 @@ it was failing on the baseline because `DeploymentService` had gained a
|
||||
`DiffService` constructor dependency from a DeploymentManager contract change
|
||||
that the test fixture had not been updated for; `DiffService` is now registered
|
||||
in the fixture.)
|
||||
|
||||
### CentralUI-020 — Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:39-62`; `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:29-43` |
|
||||
|
||||
**Description**
|
||||
|
||||
The CentralUI-004 fix and the CentralUI-005 fix interact destructively.
|
||||
CentralUI-004 made `CookieAuthenticationStateProvider` snapshot the principal
|
||||
**once** in its constructor into a cached `Task<AuthenticationState>` and serve
|
||||
that exact task for the entire life of the SignalR circuit — it never re-reads
|
||||
`HttpContext`, never calls `SetAuthenticationState`, and never raises
|
||||
`NotifyAuthenticationStateChanged`. CentralUI-005 then rewrote
|
||||
`SessionExpiry.razor` to *poll* `AuthStateProvider.GetAuthenticationStateAsync()`
|
||||
once a minute and redirect to `/login` "once the sliding cookie has actually
|
||||
lapsed server-side." But `GetAuthenticationStateAsync()` returns the same frozen
|
||||
constructor-time snapshot on every call — `auth.User.Identity.IsAuthenticated`
|
||||
is permanently `true` for the life of the circuit regardless of whether the
|
||||
server-side cookie has expired. The poll loop therefore never observes an
|
||||
expired session and the redirect never fires. An idle user whose cookie has
|
||||
lapsed server-side keeps an authenticated-looking page open indefinitely; the
|
||||
documented "30-minute idle timeout" is silently defeated for any user who
|
||||
leaves a circuit open. (The cookie middleware would still reject the *next*
|
||||
full HTTP request / new circuit, so this is a stale-UI / missed-logout exposure
|
||||
rather than a full auth bypass — but the page continues to render
|
||||
authenticated content and a SignalR circuit can stay alive for a long time.)
|
||||
This is also a design-document-adherence regression against CLAUDE.md
|
||||
"Security & Auth" (30-minute idle timeout) — recorded under Concurrency because
|
||||
the root cause is the lifetime/staleness mismatch between the two components.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
`SessionExpiry` must consult something that actually reflects the live
|
||||
server-side session, not the circuit's frozen principal. Options: (a) have
|
||||
`SessionExpiry` poll a lightweight authenticated server endpoint (e.g. a
|
||||
`/auth/ping` minimal API that returns 401 once the cookie has lapsed) and
|
||||
redirect on 401; or (b) give `CookieAuthenticationStateProvider` a refresh path
|
||||
that re-validates the cookie and calls `SetAuthenticationState` /
|
||||
`NotifyAuthenticationStateChanged` so the polled state can actually change.
|
||||
Whichever is chosen, add a test that exercises the redirect path with an
|
||||
expired session (see CentralUI-025).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-021 — `DebugView` stream callback mutates `Dictionary` off the render thread
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:404-419,511-519,275-289` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `onEvent` callback passed to `DebugStreamService.StartStreamAsync` runs on
|
||||
an Akka/gRPC thread (as the design doc and the CentralUI-009 comments state). It
|
||||
calls `UpsertWithCap(_attributeValues, …)` / `UpsertWithCap(_alarmStates, …)`
|
||||
**directly on that thread** — the mutation is not marshalled through
|
||||
`InvokeAsync`; only the subsequent `StateHasChanged` is. Meanwhile the render
|
||||
thread evaluates `FilteredAttributeValues` / `FilteredAlarmStates`, which
|
||||
enumerate `_attributeValues.Values` / `_alarmStates.Values` and call
|
||||
`OrderBy(...).ToList()`. `Dictionary<TKey,TValue>` is not thread-safe: a write
|
||||
on the Akka thread concurrent with an enumeration on the render thread can throw
|
||||
`InvalidOperationException` ("Collection was modified; enumeration operation may
|
||||
not execute") or corrupt the dictionary's internal buckets. The CentralUI-009
|
||||
fix added a `_disposed` guard but did not address this data race — the guard
|
||||
only prevents touching a *disposed* component, not concurrent access to a live
|
||||
one. Under a busy debug stream this will intermittently fault the page.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Marshal the dictionary mutation onto the render thread too — move the
|
||||
`UpsertWithCap` call inside the `SafeInvokeAsync`/`InvokeAsync` body so all
|
||||
access to `_attributeValues`/`_alarmStates` happens on the renderer's
|
||||
dispatcher. Alternatively guard both the writes and the `Filtered*` reads with a
|
||||
lock, or use a concurrent collection. The cap-trim loop must be inside the same
|
||||
critical section as the upsert.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-022 — `Deployments` push handler fires `InvokeAsync` with no disposal guard
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:221-229,317-322` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OnDeploymentStatusChanged` is invoked by `IDeploymentStatusNotifier`, a process
|
||||
singleton, on the DeploymentManager service thread. The handler does
|
||||
`_ = InvokeAsync(async () => { await LoadDataAsync(); StateHasChanged(); })`,
|
||||
discarding the returned task. `Dispose()` unsubscribes the handler, but there is
|
||||
a race window: the notifier can read the subscriber list and begin invoking
|
||||
`OnDeploymentStatusChanged` *just before* the component is disposed, so
|
||||
`InvokeAsync` then runs against a disposed component and throws
|
||||
`ObjectDisposedException` on the DeploymentManager thread — an unobserved task
|
||||
exception (the task is fire-and-forget). The same hazard was explicitly fixed
|
||||
for `DebugView` (CentralUI-009, `SafeInvokeAsync` + `_disposed` flag) and
|
||||
`ToastNotification` (CentralUI-010), but the new push-based `Deployments`
|
||||
handler introduced by the CentralUI-006 fix did not adopt the same guard.
|
||||
Separately, every push event triggers two full repository reloads
|
||||
(`GetAllInstancesAsync` + `GetAllDeploymentRecordsAsync`) for every open
|
||||
circuit, so a burst of status changes amplifies into N×2 round-trips per tick.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `volatile bool _disposed` set first in `Dispose()`, have
|
||||
`OnDeploymentStatusChanged` no-op when set, and wrap the `InvokeAsync` dispatch
|
||||
to swallow `ObjectDisposedException` (mirror `DebugView.SafeInvokeAsync`).
|
||||
Optionally coalesce bursts (debounce) and/or reload only the changed record
|
||||
rather than the whole table on each event.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-023 — Residual bare `catch {}` blocks swallow JS interop errors
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:690-698`; `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:107-116,118-130,104` |
|
||||
|
||||
**Description**
|
||||
|
||||
CentralUI-018 narrowed the bare `catch {}` blocks in `MonacoEditor`,
|
||||
`TreeView`, and `Sites.razor`, but the same pattern survives elsewhere.
|
||||
`ParkedMessages.CopyAsync` wraps `navigator.clipboard.writeText` in
|
||||
`catch { _toast.ShowError("Copy failed."); }` — a real `JSException`
|
||||
(clipboard permission denied) and an expected `JSDisconnectedException` are
|
||||
treated identically and neither is logged. `DiffDialog.TryLockBodyAsync` /
|
||||
`TryUnlockBodyAsync` each have a bare outer `catch` whose handler does another
|
||||
JS call wrapped in a second bare `catch { /* swallow */ }`, and
|
||||
`OnAfterRenderAsync`'s `_modalRef.FocusAsync()` is wrapped in a bare
|
||||
`catch { /* prerender or detached: ignore */ }`. Genuine interop failures in
|
||||
these paths are invisible in production logs.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Catch `JSDisconnectedException` silently and `JSException` (and
|
||||
`InvalidOperationException` for the prerender focus case) with an `ILogger`
|
||||
call, consistent with the CentralUI-018 fixes in the same module.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-024 — Claim lookups use magic strings instead of `JwtTokenService` constants
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:102`; `src/ScadaLink.CentralUI/Components/Pages/Dashboard.razor:14`; `GetCurrentUserAsync` in `Templates.razor`, `TemplateEdit.razor`, `TemplateCreate.razor`, `SharedScripts.razor`, `SharedScriptForm.razor`, `Sites.razor`, `Topology.razor`, `InstanceCreate.razor`, `InstanceConfigure.razor` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ScadaLink.Security.JwtTokenService` exposes the canonical claim-type constants
|
||||
(`UsernameClaimType = "Username"`, `DisplayNameClaimType = "DisplayName"`,
|
||||
`RoleClaimType`, `SiteIdClaimType`). `SiteScopeService` correctly uses
|
||||
`JwtTokenService.SiteIdClaimType`, but every `GetCurrentUserAsync` helper across
|
||||
ten pages does `authState.User.FindFirst("Username")?.Value`, and `NavMenu` /
|
||||
`Dashboard` do `context.User.FindFirst("DisplayName")`. The literals happen to
|
||||
match the constants today, so there is no live bug — but if the claim type is
|
||||
ever renamed in `JwtTokenService` (the single source of truth) every one of
|
||||
these call sites silently breaks, falling back to `"unknown"` for the audit
|
||||
user and a blank display name. The duplicated `GetCurrentUserAsync` helper is
|
||||
also copy-pasted verbatim into ten components.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the string literals with `JwtTokenService.UsernameClaimType` /
|
||||
`DisplayNameClaimType`. Consider extracting the repeated `GetCurrentUserAsync`
|
||||
into a single shared helper (e.g. an extension on `AuthenticationStateProvider`
|
||||
or a small scoped service) so the claim lookup lives in exactly one place.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-025 — `SessionExpiry` polling/redirect path has no test coverage
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.CentralUI.Tests/Auth/SessionExpiryPolicyTests.cs`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SessionExpiryPolicyTests` covers only `AuthEndpoints.BuildSignInProperties()`
|
||||
(the sign-in properties shape). The actual runtime behaviour of
|
||||
`SessionExpiry.razor` — that an expired session triggers a redirect to
|
||||
`/login`, that an authenticated session does not, and that the component does
|
||||
not poll/redirect on the `/login` page itself — is untested. Had a behavioural
|
||||
test exercised the redirect with an expired/anonymous auth state against the
|
||||
real `CookieAuthenticationStateProvider`, the CentralUI-020 defect (the frozen
|
||||
snapshot never reporting an expired session) would have been caught. The
|
||||
component is the system's only client-side idle-logout mechanism, so the gap is
|
||||
material.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add bUnit tests for `SessionExpiry`: (a) with an unauthenticated auth state the
|
||||
component navigates to `/login`; (b) with an authenticated state it does not;
|
||||
(c) on the `/login` route it neither polls nor redirects. The provider used in
|
||||
the test must be one whose state can actually transition to expired — which
|
||||
also forces the CentralUI-020 fix.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ClusterInfrastructure` |
|
||||
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -29,20 +29,39 @@ every other component runs on, yet it presently delivers nothing the design requ
|
||||
The single options class is clean and its test covers defaults and setters
|
||||
adequately for what exists.
|
||||
|
||||
**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008)
|
||||
were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions`
|
||||
gained XML docs, the `SectionName` constant, and the `DownIfAlone` property;
|
||||
`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the
|
||||
validator and throws from the dead actor-registration method; and the test project
|
||||
grew to 16 cases across three test classes. The module is in good shape — the
|
||||
`ClusterOptions` contract, its validator, and the DI registration are all sound,
|
||||
well-documented, and well-tested. This re-review examined all three source files and
|
||||
all three test files against the full 10-category checklist and found **two new
|
||||
issues**, both stemming from work the prior review explicitly deferred to a "Host
|
||||
review" that has not happened: the `DownIfAlone` property is exposed and validated as
|
||||
part of the configuration contract but is never consumed — `ScadaLink.Host`'s
|
||||
`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator
|
||||
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
|
||||
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
|
||||
`39d737e`) are appended in each row.
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
|
||||
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
|
||||
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
|
||||
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
|
||||
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
|
||||
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
|
||||
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
|
||||
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
|
||||
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
|
||||
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. |
|
||||
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). |
|
||||
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. |
|
||||
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. |
|
||||
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). |
|
||||
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. |
|
||||
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
|
||||
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -490,3 +509,97 @@ section, and the component table reflects the true placement. This is a
|
||||
documentation-only finding, so no runtime regression test is meaningful; verified by
|
||||
inspection of `ServiceCollectionExtensions.cs` and
|
||||
`docs/requirements/Component-ClusterInfrastructure.md:21-39`.
|
||||
|
||||
### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as
|
||||
part of "the split-brain configuration contract". It is public, defaults to `true`,
|
||||
carries an XML doc presenting it as "the design-doc requirement", and is exercised by
|
||||
`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it.
|
||||
The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`,
|
||||
which **hard-codes** the resolver setting:
|
||||
|
||||
```
|
||||
split-brain-resolver {
|
||||
active-strategy = ...
|
||||
stable-after = ...
|
||||
keep-oldest {
|
||||
down-if-alone = on
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`BuildHocon` receives the full `ClusterOptions` instance and consumes every other
|
||||
field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`,
|
||||
`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely.
|
||||
The result is a configuration property that an operator can set in `appsettings.json`,
|
||||
that passes validation, and that has **zero runtime effect** — setting
|
||||
`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly
|
||||
acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host`
|
||||
change ... noted for the Host's review") but the wiring was never done and no tracked
|
||||
finding carried it, so the gap has silently persisted to commit `39d737e`. An inert,
|
||||
misleadingly-documented configuration knob is a correctness and design-adherence
|
||||
defect: it gives operators a false sense of control over a safety-critical resolver
|
||||
setting.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`
|
||||
— so the property does what its XML doc claims (a Host-side change, to be tracked in
|
||||
the Host module's review since `BuildHocon` lives there), or (b) if the flag is
|
||||
intentionally fixed at `on` and must never be operator-configurable, remove the
|
||||
`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as
|
||||
a non-negotiable invariant. Do not leave a public, settable, validated property that
|
||||
controls nothing.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest
|
||||
resolver must be configured with `down-if-alone = on`, and the XML doc on
|
||||
`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale
|
||||
being that without it the oldest node can run as an isolated single-node cluster
|
||||
during a partition while the younger node forms its own. `ClusterOptionsValidator`
|
||||
guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only
|
||||
strategy, positive timings, heartbeat below the failure threshold) but performs no
|
||||
check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes
|
||||
validation cleanly. This is currently latent because CI-009 shows the property is not
|
||||
consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON
|
||||
(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node
|
||||
behaviour the design doc explicitly forbids — with no fail-fast guard. The validator
|
||||
is the right place to enforce the invariant, consistent with how it already rejects
|
||||
quorum split-brain strategies.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to
|
||||
`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if
|
||||
some future deployment legitimately needs it off, fails only in combination with the
|
||||
`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster
|
||||
hazard. If CI-009 is resolved by removing the property, this finding is moot and
|
||||
should be closed as resolved alongside it.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Commons` |
|
||||
| Design doc | `docs/requirements/Component-Commons.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -32,14 +32,28 @@ kind of edge-case logic that warrants them. Entity and message contracts otherwi
|
||||
clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a
|
||||
wire command.
|
||||
|
||||
**Re-review 2026-05-17 (commit `39d737e`).** All twelve prior findings (Commons-001
|
||||
through Commons-012) are confirmed `Resolved` — the fixes are sound, well-targeted, and
|
||||
backed by focused regression tests (`StaleTagMonitorRaceTests`, `DynamicJsonElementTests`,
|
||||
`ScriptParametersTests`, `ManagementCommandRegistryTests`, `OpcUaEndpointConfigSerializerTests`,
|
||||
`ResultTests`, `ValueFormatterTests`, `ConnectionBindingSerializationTests`,
|
||||
`FlatteningAndScriptScopeTests`). The new files introduced since `9c60592`
|
||||
(`TemplateAlarm` lock/inherit fields, `IExternalSystemRepository` name-keyed lookups,
|
||||
`DeploymentStateQueryRequest`/`Response`, `ParameterDefinition`) follow the established
|
||||
POCO / record / additive-evolution conventions and carry round-trip compatibility tests.
|
||||
Two new Low-severity findings were recorded this pass: a `DynamicJsonElement` array
|
||||
indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSerializer`
|
||||
legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
|
||||
No Critical, High, or Medium issues were found.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. |
|
||||
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. Re-review: `DynamicJsonElement.TryGetIndex` rejects non-`int` indices (Commons-013). |
|
||||
| 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). |
|
||||
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). |
|
||||
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). Re-review: corrupt typed OPC UA rows can fall through to the legacy path and be mislabelled `Legacy` (Commons-014). |
|
||||
| 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. |
|
||||
| 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). |
|
||||
| 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. |
|
||||
@@ -566,3 +580,75 @@ the parameterless `ToString()`). The XML doc gained a remarks block stating the
|
||||
culture-invariant contract and why. Regression tests added in `ValueFormatterTests`
|
||||
(`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`,
|
||||
each pinned under `de-DE`).
|
||||
|
||||
### Commons-013 — `DynamicJsonElement.TryGetIndex` rejects non-`int` index values
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:40-54` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TryGetIndex` accepts an index only when `indexes[0] is int index`. `DynamicJsonElement`
|
||||
is designed for dynamic access from scripts (`obj.items[0]`). In a `dynamic` expression the
|
||||
index operand's runtime type follows the script's variable type — a script that computes
|
||||
an index in a loop counter or reads it from another `DynamicJsonElement` (whose numbers
|
||||
are unwrapped as `long` by `Wrap`, see `:105`) will pass a `long`, not an `int`. The
|
||||
pattern match then fails, `TryGetIndex` returns `false`, and the dynamic binder throws a
|
||||
`RuntimeBinderException` for what is a perfectly valid in-range index. Because the wrapper
|
||||
itself surfaces JSON numbers as `long`, `obj.items[obj.count - 1]` — count being a wrapped
|
||||
JSON number — is the exact failing case. The `int`-only guard also silently rejects
|
||||
`byte`/`short` indices that would widen to a valid array position.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Accept any integral index by converting through `Convert.ToInt64` (guarded for
|
||||
`OverflowException`) or by matching `int`, `long`, `short`, `byte` and normalizing to a
|
||||
single integer before the bounds check. Add a regression test indexing with a `long`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Commons-014 — `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:107-131` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Deserialize` tries the typed path first: it parses the document, checks for an
|
||||
`endpointUrl` property, then calls `JsonSerializer.Deserialize<OpcUaEndpointConfig>`.
|
||||
The whole block is wrapped in `catch (JsonException) { /* fall through to legacy */ }`.
|
||||
If a row *is* the current typed shape (it has `endpointUrl`) but is corrupt in a way that
|
||||
makes `JsonSerializer.Deserialize` throw a `JsonException` — e.g. an enum-valued field
|
||||
holding an unrecognised string, or a numeric field holding a non-numeric token — the
|
||||
exception is swallowed and control falls through to `LoadLegacy`. `LoadLegacy` only
|
||||
requires the root to be a JSON object, so it will usually succeed against the same input
|
||||
and the result is reported as `OpcUaConfigParseStatus.Legacy`. The Commons-005 fix added
|
||||
the `Malformed` status precisely so a caller can tell a recoverable legacy row from
|
||||
unparseable data; this path re-introduces a softer version of the same confusion — a
|
||||
genuinely broken current-shape row is presented to the user as a benign "please re-save"
|
||||
legacy row, and the offending field is silently dropped by `FromFlatDict` (which ignores
|
||||
keys it cannot parse) rather than surfaced. The XML doc describes the legacy fallback as
|
||||
being for "pre-refactor rows" only and does not mention this branch.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Only fall through to `LoadLegacy` when the typed shape is genuinely *not present* — i.e.
|
||||
the `endpointUrl` property is absent. When `endpointUrl` *is* present but typed
|
||||
deserialization throws, classify the outcome as `Malformed` (or a distinct status) so the
|
||||
caller can surface a real error instead of an empty/partial config. Tighten the XML doc
|
||||
to describe this branch, and add a regression test for a typed row with an invalid enum
|
||||
field.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Communication` |
|
||||
| Design doc | `docs/requirements/Component-Communication.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe
|
||||
good. Test coverage is broad for happy paths but has gaps around failover, cache
|
||||
mutation races, and the snapshot-timeout cleanup path.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
|
||||
and the fixes hold up against the source. The re-review walked all 10 checklist
|
||||
categories again and uncovered a previously-missed defect at the centre of the gRPC
|
||||
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
|
||||
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
|
||||
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
|
||||
node's endpoint, but the factory hands back the original NodeA-bound client every
|
||||
time — so the documented "try the other site node endpoint" failover never actually
|
||||
moves to NodeB (Communication-012). The same caching defect means a site address
|
||||
change is never picked up because `RemoveSiteAsync` has no production caller
|
||||
(Communication-013). Two Low findings round out the re-review: an untrusted
|
||||
gRPC-supplied `correlation_id` flows straight into an Akka actor name
|
||||
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
|
||||
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
|
||||
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
|
||||
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
|
||||
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
|
||||
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
|
||||
| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
|
||||
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
|
||||
| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
|
||||
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
|
||||
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
|
||||
| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
|
||||
| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
|
||||
| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
|
||||
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
|
||||
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
|
||||
| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -519,3 +536,151 @@ passes after):
|
||||
(added with this finding's resolution).
|
||||
The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at
|
||||
111 passing tests.
|
||||
|
||||
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
|
||||
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
|
||||
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
|
||||
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
|
||||
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
|
||||
|
||||
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
|
||||
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
|
||||
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
|
||||
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
|
||||
the *other* node. Because the factory ignores the new endpoint, the bridge actor
|
||||
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
|
||||
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
|
||||
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
|
||||
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
|
||||
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
|
||||
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
|
||||
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
|
||||
but it unsubscribes and re-subscribes on the *same* client/node rather than the
|
||||
intended other node.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the per-site client aware of both endpoints, or key the cache by
|
||||
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
|
||||
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
|
||||
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
|
||||
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
|
||||
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
|
||||
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
|
||||
targets the other endpoint.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
|
||||
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
|
||||
a repo-wide search finds **no production caller** — only tests invoke it. Combined
|
||||
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
|
||||
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
|
||||
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
|
||||
in the cached client — it keeps using the stale channel for the life of the process.
|
||||
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
|
||||
recreates ClusterClients on change, but there is no equivalent invalidation path
|
||||
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
|
||||
an initial misconfiguration will never have working debug streaming until the central
|
||||
node is restarted.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory` —
|
||||
e.g. have `CentralCommunicationActor` (which already detects address changes in
|
||||
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
|
||||
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
|
||||
the on-the-fly address-change requirement is intentionally dropped, remove
|
||||
`RemoveSiteAsync` and correct the design doc.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
|
||||
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
|
||||
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
|
||||
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
|
||||
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
|
||||
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
|
||||
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
|
||||
already been set up for the duration, though the `finally` does not run because the
|
||||
throw is before the `try`). In practice central always supplies a GUID, so impact is
|
||||
low, but the server is trusting client-supplied input to be actor-name-safe.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
|
||||
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
|
||||
name solely from the internal `_actorCounter` and keep the correlation ID only as
|
||||
actor state / dictionary key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-015 — No test exercises the real gRPC client factory across a node flip
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
|
||||
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
|
||||
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
|
||||
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
|
||||
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
|
||||
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
|
||||
`GetOrCreate` returns the same client for the same site — it never checks what happens
|
||||
when the same site is requested with a *different* endpoint.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
|
||||
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
|
||||
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
|
||||
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ConfigurationDatabase` |
|
||||
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -37,6 +37,28 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
|
||||
`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
|
||||
`SiteRepository`, `InstanceLocator`) have little or no direct coverage.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed the module at commit `39d737e`. All eleven findings from the initial
|
||||
review (`9c60592`) remain `Resolved` — the secret-column encryption
|
||||
(`EncryptedStringConverter` + `ApplySecretColumnEncryption`), the fail-fast no-arg
|
||||
DI overload, the `AsSplitQuery` conversions, the audit cycle-safe serializer, and the
|
||||
added repository test coverage all verified present and consistent with their
|
||||
resolutions. Three new findings were recorded. The most material is that the
|
||||
encryption work done for CD-004 left one bearer credential out of scope:
|
||||
`ApiKey.KeyValue` — the inbound-API authentication secret — is still persisted in
|
||||
plaintext (`ConfigurationDatabase-012`); it cannot use the same non-deterministic
|
||||
Data Protection converter because authentication looks the key up *by value*, so it
|
||||
needs a hash-based scheme instead. The second is a resilience gap in the encryption
|
||||
plumbing itself: `ApplySecretColumnEncryption` silently substitutes a throwaway
|
||||
`EphemeralDataProtectionProvider` whenever no provider is supplied, so any context
|
||||
constructed via the single-argument constructor on a write path would encrypt
|
||||
secrets with a key discarded at process exit, yielding permanently undecryptable
|
||||
ciphertext with no error (`ConfigurationDatabase-013`). The third is a minor
|
||||
inconsistency — a redundant cast on one of the three `HasConversion` calls
|
||||
(`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
|
||||
hold up well.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -44,11 +66,11 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
|
||||
| 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
|
||||
| 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
|
||||
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). |
|
||||
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). |
|
||||
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. |
|
||||
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. |
|
||||
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). |
|
||||
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003, resolved); audit JSON serialization failure handling (CD-007, resolved). Re-review: ephemeral Data Protection fallback can silently produce undecryptable ciphertext (CD-013). |
|
||||
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002, resolved); SMTP/DB-connection/auth secrets unencrypted (CD-004, resolved). Re-review: `ApiKey.KeyValue` bearer credential still stored in plaintext (CD-012). |
|
||||
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009, resolved). No N+1 in audited paths. Re-review: no new issues. |
|
||||
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005, resolved); seed data uses `HasData` consistent with design. Re-review: no new issues. |
|
||||
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006, resolved); inconsistent null-guard on injected context (CD-011, resolved). Re-review: redundant/inconsistent cast on one `HasConversion` call (CD-014). |
|
||||
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
|
||||
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
|
||||
|
||||
@@ -570,3 +592,128 @@ every data-access type behaves uniformly and a hand-constructed instance fails w
|
||||
informative exception at construction rather than a later `NullReferenceException`.
|
||||
Regression: `Constructor_NullContext_Throws` tests were added for all four affected types
|
||||
(`InboundApiRepositoryTests.cs`, `RepositoryCoverageTests.cs`).
|
||||
|
||||
### ConfigurationDatabase-012 — Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/InboundApiConfiguration.cs:17-19` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ApiKey.KeyValue` is the bearer credential presented in the `X-API-Key` header to
|
||||
authenticate Inbound API requests (HighLevelReqs §7.2–7.3). It is mapped as an
|
||||
ordinary `nvarchar(500)` column with a unique index and persisted verbatim. Anyone
|
||||
with read access to the configuration database — or to any `AuditLogEntry.AfterStateJson`
|
||||
into which an `ApiKey` entity is serialized — obtains live API credentials in cleartext.
|
||||
|
||||
`ConfigurationDatabase-004` introduced encryption-at-rest for the other secret-bearing
|
||||
columns (SMTP credentials, external-system auth config, database connection strings)
|
||||
but explicitly scoped `ApiKey.KeyValue` out. The omission is genuine: the
|
||||
`EncryptedStringConverter` built for CD-004 is backed by ASP.NET Data Protection, which
|
||||
is **non-deterministic** — the same plaintext encrypts to different ciphertext each
|
||||
time — so it cannot be applied here, because `GetApprovedKeysForMethodAsync` and the
|
||||
authentication path resolve a key by its value (`GetApiKeyByValueAsync` does
|
||||
`FirstOrDefaultAsync(k => k.KeyValue == keyValue)`). A non-deterministic converter would
|
||||
break that equality lookup. The result is that the one credential most exposed to
|
||||
external callers is the one credential left unprotected.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Store a salted cryptographic hash of the key value instead of the plaintext (or
|
||||
ciphertext): hash on create, and authenticate by hashing the presented key and
|
||||
comparing. This keeps the equality lookup working (the hash is deterministic) while
|
||||
ensuring the database never holds a usable credential. The plaintext key would then
|
||||
only ever be shown once, at creation time, to the Admin who created it. This requires
|
||||
a coordinated change with the Inbound API / Security components and the `ApiKey`
|
||||
entity in Commons; record the chosen scheme in
|
||||
`docs/requirements/Component-ConfigurationDatabase.md` and the Inbound API design doc.
|
||||
At minimum, ensure `ApiKey` entities are never passed to `IAuditService` without the
|
||||
key value redacted.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ConfigurationDatabase-013 — Secret-column encryption silently falls back to an ephemeral (throwaway) key
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:107-124` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ApplySecretColumnEncryption` resolves the Data Protection provider as
|
||||
`_dataProtectionProvider ?? new EphemeralDataProtectionProvider()`. The `??` fallback
|
||||
is reached whenever the context is constructed via the single-argument
|
||||
`ScadaLinkDbContext(DbContextOptions)` constructor — i.e. whenever no provider was
|
||||
injected. An `EphemeralDataProtectionProvider` generates a key ring that lives only in
|
||||
process memory and is discarded at process exit.
|
||||
|
||||
For design-time `dotnet ef` tooling this is harmless (the XML remark correctly notes
|
||||
it only emits schema). The risk is on a *runtime write path*. The runtime currently
|
||||
gets the provider-bearing context only because `AddConfigurationDatabase` adds an
|
||||
`AddScoped` factory registration that overrides EF's activator-based registration.
|
||||
That override is the single thing standing between correct behaviour and silent data
|
||||
corruption: any future change that resolves a `ScadaLinkDbContext` through a path the
|
||||
override does not cover — an `AddPooledDbContextFactory`/`IDbContextFactory<ScadaLinkDbContext>`
|
||||
registration, a second `AddDbContext` call, a hand-constructed context in server code —
|
||||
would construct the context with the single-arg constructor, encrypt secret columns
|
||||
with a throwaway key, and persist ciphertext that becomes **permanently undecryptable
|
||||
the moment the process restarts**. There is no exception, no warning; the failure only
|
||||
surfaces later as `CryptographicException` on read (mis-attributed by
|
||||
`EncryptedStringConverter` to "the stored value was not written by this system").
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Do not silently substitute an ephemeral provider for write-capable contexts. Either:
|
||||
(a) require the provider unconditionally and have design-time tooling pass an explicit
|
||||
ephemeral provider so the intent is visible at the call site; or (b) keep the
|
||||
single-arg constructor but mark contexts built without a real provider as
|
||||
schema-only — e.g. record a flag and have the encrypting converter throw a clear
|
||||
`InvalidOperationException` ("secret columns cannot be written without a configured
|
||||
Data Protection key ring") on the first `Protect`, instead of producing throwaway
|
||||
ciphertext. Also harden the DI wiring so a `ScadaLinkDbContext` cannot be resolved
|
||||
through the EF-activator registration at all (e.g. register only the factory, or use
|
||||
`AddDbContextFactory` with the explicit constructor).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ConfigurationDatabase-014 — Redundant, inconsistent cast on one `HasConversion` call
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:121-123` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ApplySecretColumnEncryption` calls `HasConversion(converter)` three times. The first
|
||||
two (`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`)
|
||||
pass `converter` directly; the third (`DatabaseConnectionDefinition.ConnectionString`)
|
||||
casts it to `(Microsoft.EntityFrameworkCore.Storage.ValueConversion.ValueConverter)`.
|
||||
`EncryptedStringConverter` already derives from `ValueConverter<string?, string?>`
|
||||
(itself a `ValueConverter`), and the first two call sites compile fine without the
|
||||
cast, so the cast is redundant. The inconsistency makes the code look as though the
|
||||
third call needs special handling when it does not, and the fully-qualified type name
|
||||
inline adds noise.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Remove the cast so all three calls read identically as `HasConversion(converter)`.
|
||||
If a `ValueConverter`-typed reference is genuinely wanted, give it a local variable of
|
||||
that type once and use it for all three.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.DataConnectionLayer` |
|
||||
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
|
||||
heuristic. Test coverage is adequate for the happy paths and failover but absent for
|
||||
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
|
||||
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
|
||||
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
|
||||
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
|
||||
and found **4 new findings**: one **High** — the DCL-012 security warning is never
|
||||
seen in production because `RealOpcUaClientFactory.Create()` constructs
|
||||
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
|
||||
**Medium** — initial-connect failures in the `Connecting` state never count toward
|
||||
failover, so a connection whose primary endpoint is unreachable at startup retries the
|
||||
primary forever and never tries the configured backup; one **Medium** —
|
||||
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
|
||||
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
|
||||
telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
|
||||
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
|
||||
so a mid-batch disconnect aborts the whole write batch (the same class of defect
|
||||
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
|
||||
`DataConnectionLayer-014`.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
|
||||
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
|
||||
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
|
||||
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
|
||||
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
|
||||
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
|
||||
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
|
||||
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
|
||||
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
|
||||
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
|
||||
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
|
||||
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
|
||||
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
|
||||
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
|
||||
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
|
||||
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
|
||||
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
|
||||
| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
|
||||
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
|
||||
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
|
||||
non-atomic check-then-set it double-fires (verified by temporarily reverting the
|
||||
guard), and it passes against the atomic fix.
|
||||
|
||||
### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
|
||||
|
||||
**Description**
|
||||
|
||||
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
|
||||
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
|
||||
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
|
||||
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
|
||||
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
|
||||
|
||||
However, the only production code path that constructs a `RealOpcUaClient` is
|
||||
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
|
||||
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
|
||||
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
|
||||
As a result the `_logger` field is always `NullLogger` for every real OPC UA
|
||||
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
|
||||
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
|
||||
enabled — accepting any server certificate on an industrial control link — gets no
|
||||
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
|
||||
therefore not actually effective in production; only the unit test
|
||||
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
|
||||
checks the default value) passes.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
|
||||
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
|
||||
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
|
||||
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
|
||||
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
|
||||
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
|
||||
the default is `false`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
|
||||
|
||||
**Description**
|
||||
|
||||
Failover between the primary and backup endpoints is implemented in two places, both
|
||||
reachable only after a connection has already been `Connected` at least once:
|
||||
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
|
||||
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
|
||||
|
||||
`HandleConnectResult` — the handler for the *initial* connection attempt in the
|
||||
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
|
||||
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
|
||||
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
|
||||
|
||||
Consequence: if the primary endpoint is unreachable when the connection actor first
|
||||
starts — which is the common case after a fresh artifact deployment, a site restart,
|
||||
or a primary that is simply down at that moment — the actor retries the *primary*
|
||||
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
|
||||
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
|
||||
active endpoint becomes unreachable") is silently not honoured for the
|
||||
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
|
||||
despite a healthy backup being configured.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `HandleConnectResult` participate in the failover counter the same way
|
||||
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
|
||||
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
|
||||
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
|
||||
log the failover event) before re-arming the timer. Alternatively, fold the initial
|
||||
connect into the same reconnect path so there is a single failover decision point. Add
|
||||
a regression test for "primary down at startup, backup configured → fails over to
|
||||
backup".
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
|
||||
so the `Connected`-state handler can drive the actor into `Reconnecting`
|
||||
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
|
||||
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
|
||||
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
|
||||
level.
|
||||
|
||||
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
|
||||
told the subscribe **succeeded**, while the connection actor simultaneously transitions
|
||||
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
|
||||
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
|
||||
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
|
||||
healthy" from "accepted, but the connection is currently down" — a misleading
|
||||
success signal on a request that did not do what the response claims.
|
||||
|
||||
(Genuine tag-resolution failures are arguably also reported as overall `true`, but
|
||||
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
|
||||
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
|
||||
already pushed. The connection-level case is the clear defect because the actor itself
|
||||
treats it as a failure worth a state transition.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
When `connectionLevelFailure` is true, reply with
|
||||
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
|
||||
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
|
||||
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
|
||||
Instance Actor can reflect partial success. Add a test asserting the response is not
|
||||
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
|
||||
|
||||
**Description**
|
||||
|
||||
`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
|
||||
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
|
||||
status does not abort the batch), but it first calls `EnsureConnected()`
|
||||
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
|
||||
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
|
||||
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
|
||||
caller gets no result map — losing the per-tag outcomes for the tags that already
|
||||
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
|
||||
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
|
||||
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
|
||||
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
|
||||
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
|
||||
|
||||
Severity is Low because device writes are real-time control operations with no
|
||||
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
|
||||
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
|
||||
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
|
||||
maintainability and correctness wart.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
|
||||
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
|
||||
and continues, while still propagating `OperationCanceledException` to abort a
|
||||
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
|
||||
complete, consistent result map.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.DeploymentManager` |
|
||||
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,20 +30,43 @@ detail. Configuration is not bound to `appsettings.json`, leaving one option
|
||||
entirely dead. Test coverage stops at the communication boundary and never
|
||||
exercises a successful deployment or the lifecycle success paths.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed at commit `39d737e` after the batch of fixes for
|
||||
DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
|
||||
verified against source — the broadened catch, non-cancellable cleanup writes,
|
||||
ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
|
||||
structured diff, options binding, and the expanded TestKit-actor test suite are
|
||||
all present and correct. The module is in markedly better shape than the
|
||||
first review: error paths are now defensively handled and test coverage is
|
||||
broad (successful deploy/lifecycle, lock serialization, reconciliation
|
||||
matrix, artifact per-site matrix).
|
||||
|
||||
This re-review found **3 new findings**, all clustered on the
|
||||
DeploymentManager-006 reconciliation path added since the last review. The
|
||||
reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
|
||||
record `Success` when the site already has the target revision, but it does
|
||||
**not** perform the side effects the normal success path does — it never
|
||||
updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
|
||||
and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
|
||||
DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
|
||||
it still describes the query-before-redeploy behaviour that actually moved into
|
||||
`TryReconcileWithSiteAsync` (DeploymentManager-017).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
|
||||
| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
|
||||
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
|
||||
| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
|
||||
| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
|
||||
| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
|
||||
| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
|
||||
| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
|
||||
| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
|
||||
| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
|
||||
| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
|
||||
| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
|
||||
| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
|
||||
| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
|
||||
| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
|
||||
| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
|
||||
| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -710,3 +733,126 @@ the communication boundary. New tests:
|
||||
`DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
|
||||
covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
|
||||
(per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
|
||||
|
||||
### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
|
||||
path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
|
||||
and the site reports it already has the target revision hash. In that case it
|
||||
marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
|
||||
returns it — the caller then returns `Result.Success` and **never enters the
|
||||
normal deploy body**.
|
||||
|
||||
The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
|
||||
a successful site response: writes the deployment record terminal status, sets
|
||||
`instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
|
||||
`StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
|
||||
first. Consequently, after a reconciled deployment:
|
||||
|
||||
- The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
|
||||
first-time deploy that timed out, or `Disabled`) even though the site is
|
||||
actually running the configuration — the central state machine and the site
|
||||
diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
|
||||
rejected or allowed incorrectly by `StateTransitionValidator`.
|
||||
- No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
|
||||
is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
|
||||
permanently returning `"No deployed snapshot found for this instance."`, and a
|
||||
redeploy reconciliation leaves the stored snapshot showing the *old* config
|
||||
even though the deployment record claims `Success` for the new revision.
|
||||
|
||||
The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
|
||||
deployed snapshot and instance state to reflect the last successful deployment;
|
||||
the reconciliation path silently breaks both invariants.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
|
||||
same post-success side effects as the normal path: set `instance.State =
|
||||
InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
|
||||
`StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
|
||||
config JSON. Factor the shared post-success logic into one helper so the normal
|
||||
and reconciliation paths cannot drift. Add a regression test asserting that a
|
||||
reconciled deployment leaves the instance `Enabled` and a snapshot stored.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |
|
||||
|
||||
**Description**
|
||||
|
||||
When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
|
||||
`prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
|
||||
`prior.RevisionHash`. The reconciliation condition only compares the *site's*
|
||||
`AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
|
||||
does not require `prior.RevisionHash` to equal either of them.
|
||||
|
||||
The prior record can legitimately carry a different revision hash than the
|
||||
current target: e.g. a deploy timed out at revision `R1`, the template was then
|
||||
edited so the current flatten yields `R2`, and meanwhile the site actually
|
||||
applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
|
||||
the prior record predates a hash recompute). After reconciliation the record's
|
||||
`Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
|
||||
checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
|
||||
instance as deployed at the wrong revision. The audit `DeployReconciled` entry
|
||||
records `RevisionHash = targetRevisionHash`, contradicting the persisted record.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
In the reconciled-success branch, also set `prior.RevisionHash =
|
||||
targetRevisionHash` so the persisted record, the audit entry, and the site's
|
||||
actual applied revision all agree. Alternatively, only reconcile when
|
||||
`prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
|
||||
normal deploy.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |
|
||||
|
||||
**Description**
|
||||
|
||||
The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
|
||||
failover/timeout, query site for current deployment state before
|
||||
re-deploying."* The method body does no such thing — it is a one-line
|
||||
pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
|
||||
read. The query-the-site-before-redeploy behaviour the comment describes was
|
||||
implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
|
||||
The stale comment is a leftover of the original design intent and misleads a
|
||||
reader into thinking this method contacts the site.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Reword the summary to describe what the method actually does — "returns the
|
||||
current persisted `DeploymentRecord` for the given deployment ID from the
|
||||
configuration database" — and, if useful, cross-reference
|
||||
`TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
||||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,20 +30,41 @@ Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGa
|
||||
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
|
||||
requirements (timeout, retry settings) that are declared but not implemented.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All fourteen prior findings remain `Resolved`; the resolutions for findings 001–014
|
||||
were spot-checked against the current source and hold. The re-review walked the full
|
||||
10-category checklist again and surfaced **three new findings**. The most serious
|
||||
(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
|
||||
`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
|
||||
per-system/per-connection `MaxRetries` of `0` through verbatim, but
|
||||
`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
|
||||
**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
|
||||
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
|
||||
assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
|
||||
`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
|
||||
**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
|
||||
option) onto every `HttpClient` in the host, including the Notification Service's
|
||||
OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
|
||||
(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
|
||||
both substantive findings are second-order defects in earlier fixes — the earlier
|
||||
resolutions did not verify the downstream contract of the S&F engine they integrate
|
||||
with.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
|
||||
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
|
||||
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
|
||||
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
|
||||
| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
|
||||
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
|
||||
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
|
||||
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
|
||||
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
|
||||
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
|
||||
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
|
||||
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
|
||||
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
|
||||
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
|
||||
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
|
||||
| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
|
||||
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
|
||||
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -760,3 +781,144 @@ headers and body so URL/auth/body construction is now verified, not just status
|
||||
These are new-coverage tests against already-correct behaviour, so they pass on the
|
||||
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
|
||||
against regression.
|
||||
|
||||
### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
|
||||
that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
|
||||
to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
|
||||
comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
|
||||
tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
|
||||
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
|
||||
`MaxRetries` of `0` means **"never retry"**.
|
||||
|
||||
That is the opposite of what the Store-and-Forward engine actually does with the
|
||||
value. `EnqueueAsync` stores the passed `maxRetries` directly into
|
||||
`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
|
||||
doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
|
||||
(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
|
||||
`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
|
||||
(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
|
||||
every sweep, so the message is **never parked and is retried forever**.
|
||||
|
||||
Consequences for a `CachedCall`/`CachedWrite` against a system or connection
|
||||
configured with `MaxRetries = 0`:
|
||||
|
||||
1. A transiently-failing message that the operator intended never to retry is instead
|
||||
retried on every sweep indefinitely, accumulating in the buffer and repeatedly
|
||||
re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
|
||||
the idempotency note in the design doc warns about.
|
||||
2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
|
||||
describe that as "honoured" — they verify the persisted column, never the resulting
|
||||
sweep behaviour, so they lock in the broken semantics.
|
||||
3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
|
||||
open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
|
||||
cached write at every site currently buffers as retry-forever. Before the ESG-004
|
||||
fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
|
||||
applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
|
||||
common path.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
|
||||
Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
|
||||
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
|
||||
that "never retry" then needs a different representation such as a nullable field or a
|
||||
`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
|
||||
explicit no-retry path — e.g. do not enqueue at all on transient failure when
|
||||
`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
|
||||
`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
|
||||
means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
|
||||
outcome (parked / not retried), not just the stored column value.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
|
||||
calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
|
||||
The inline comment claims this "applies to the dynamically-named clients created by
|
||||
`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
|
||||
adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
|
||||
anywhere in the host, regardless of name.
|
||||
|
||||
The Host registers the External System Gateway alongside other components that also
|
||||
use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
|
||||
and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
|
||||
present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
|
||||
has its **primary handler replaced** by a `SocketsHttpHandler` whose
|
||||
`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
|
||||
|
||||
1. Silently caps unrelated clients at a value owned by, and named for, a different
|
||||
component — an operator tuning the ESG option unknowingly throttles Microsoft 365
|
||||
OAuth2 token acquisition.
|
||||
2. Overrides/discards any primary-handler configuration those other components add for
|
||||
their own clients (e.g. a custom handler, proxy, or certificate settings).
|
||||
|
||||
This is a leaky, surprising side effect for what the option claims to be a per-ESG
|
||||
setting; `ConfigureHttpClientDefaults` should not be used to express a single
|
||||
component's policy.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Scope the handler configuration to the gateway's own clients only. The ESG already
|
||||
creates per-system clients with a deterministic name pattern
|
||||
(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
|
||||
factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
|
||||
on that registration instead of on the global defaults. If a name-pattern registration
|
||||
is impractical, document the global effect explicitly and rename the option, but the
|
||||
preferred fix is to stop using `ConfigureHttpClientDefaults`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
|
||||
|
||||
**Description**
|
||||
|
||||
In `BuildUrl`, the GET/DELETE query-string branch is entered when
|
||||
`parameters != null && parameters.Count > 0`, but the projection then filters out
|
||||
null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
|
||||
method is invoked with a non-empty parameter dictionary whose values are *all* null,
|
||||
`queryString` is the empty string and the code still executes `url += "?" + queryString`,
|
||||
producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
|
||||
tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
|
||||
on some endpoints, and makes captured request URLs harder to read in logs.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
|
||||
string first and check it), so a method whose effective parameter set is empty
|
||||
produces a clean URL identical to the no-parameters case.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.HealthMonitoring` |
|
||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
|
||||
but the concurrency issues are Medium/High and the missing S&F metric is a real
|
||||
design-adherence gap.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` —
|
||||
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
|
||||
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
|
||||
central-site offline grace and the unknown-site heartbeat registration are in
|
||||
place, and the test suite has grown to cover the report loop, heartbeat path, and
|
||||
collector setters. This re-review found **4 new findings, all Low/Medium, none
|
||||
crash-class**. They are residual polish items rather than behaviour regressions:
|
||||
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
|
||||
`HealthMonitoringOptions` intervals that crash the hosted service on
|
||||
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
|
||||
with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
|
||||
special-case (HealthMonitoring-015), and `CollectReport` reading
|
||||
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
|
||||
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
|
||||
broadly faithful to the design intent.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
|
||||
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
|
||||
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
|
||||
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
|
||||
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
|
||||
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
|
||||
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
|
||||
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
|
||||
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
|
||||
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
|
||||
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
|
||||
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
|
||||
the contract is now honest and no further code change was required.
|
||||
|
||||
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
|
||||
the (shorter) offline timeout interval for timely detection", but the code only
|
||||
reads `_options.OfflineTimeout`:
|
||||
|
||||
```csharp
|
||||
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
|
||||
```
|
||||
|
||||
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
|
||||
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
|
||||
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
|
||||
comment states an invariant the code does not enforce: if an operator configures
|
||||
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
|
||||
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
|
||||
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
|
||||
misleads a reader into believing the cadence already adapts to whichever timeout is
|
||||
shorter.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
|
||||
so the code matches the comment, or drop the "(shorter)" wording and state plainly
|
||||
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
|
||||
`CentralOfflineTimeout` is the larger value).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
|
||||
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
|
||||
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
|
||||
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
|
||||
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
|
||||
division for the check interval). `PeriodicTimer`'s constructor throws
|
||||
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
|
||||
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
|
||||
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
|
||||
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
|
||||
hosted service at startup with an opaque exception that does not name the
|
||||
offending config key, rather than failing fast with a clear validation message.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
|
||||
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
|
||||
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
|
||||
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
|
||||
configuration fails fast with a message naming the section and key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
|
||||
|
||||
**Description**
|
||||
|
||||
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
|
||||
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
|
||||
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
|
||||
field is "Used by the UI to surface report staleness during failover." A
|
||||
heartbeat-only site therefore has `LatestReport == null` **and**
|
||||
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
|
||||
"last report N ago" as `now - LastReportReceivedAt` without first checking
|
||||
`LatestReport != null` will render a nonsensical staleness of roughly two
|
||||
thousand years for a site that is, in fact, freshly reachable. The two
|
||||
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
|
||||
are independent and both must be special-cased; the sentinel value is an easy trap.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
|
||||
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
|
||||
with how `LatestReport` was already made nullable for the same case — and have UI
|
||||
consumers render staleness only when it has a value. Alternatively, document the
|
||||
`default` sentinel prominently on the field and audit every UI reader, but the
|
||||
nullable option is safer and matches the existing `LatestReport` treatment.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
|
||||
read directly from the system clock. Every other time-dependent class in the
|
||||
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
|
||||
— was deliberately refactored (HealthMonitoring-006) to take an injectable
|
||||
`TimeProvider` so the behaviour is deterministically testable and the clock
|
||||
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
|
||||
timestamp cannot be controlled in a unit test, which is why
|
||||
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
|
||||
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
|
||||
known instant. This is a minor consistency/testability gap, not a behaviour bug.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
|
||||
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
|
||||
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
|
||||
testable and the module is consistent.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
+180
-13
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Host` |
|
||||
| Design doc | `docs/requirements/Component-Host.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
|
||||
class, but the readiness bug is High because it breaks load-balancer behaviour with
|
||||
no safe workaround.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
|
||||
in the current tree — the `/health/ready` predicate, the externalised secrets, the
|
||||
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
|
||||
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
|
||||
described in their Resolution notes. This re-review walked all ten checklist
|
||||
categories again over the full module and recorded four new findings, none of them
|
||||
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
|
||||
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
|
||||
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
|
||||
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
|
||||
of cluster timing values silently degrades any sub-second configuration (Host-013),
|
||||
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
|
||||
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
|
||||
exception type including permanent schema-validation failures (Host-015).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
|
||||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
|
||||
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
|
||||
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
|
||||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
|
||||
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
|
||||
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
|
||||
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
|
||||
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
|
||||
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
|
||||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
|
||||
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
|
||||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
|
||||
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
|
||||
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
|
||||
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
|
||||
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
|
||||
— they assert the configured level actually filters log events; they pass against the
|
||||
new factory (and would fail against the old inline configuration, which ignored the
|
||||
key). Full Host suite green (175 passed).
|
||||
|
||||
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
|
||||
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
|
||||
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
|
||||
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
|
||||
documented as "the design-doc requirement", default `true` — but a repo-wide search
|
||||
shows it is referenced **nowhere outside its own declaration**. The Host therefore
|
||||
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
|
||||
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
|
||||
on`. This is the same class of defect as the resolved Host-011
|
||||
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
|
||||
declared, bound, and documented but never consumed, which silently misleads any
|
||||
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
|
||||
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
|
||||
ClusterInfrastructure project owns only the configuration contract, per the
|
||||
`ClusterOptions` class comment).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
|
||||
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
|
||||
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
|
||||
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
|
||||
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
|
||||
doc comment instead so code and configuration contract agree — but do not leave it
|
||||
declared-but-dead.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
|
||||
|
||||
**Description**
|
||||
|
||||
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` —
|
||||
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
|
||||
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
|
||||
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
|
||||
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
|
||||
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
|
||||
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
|
||||
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
|
||||
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
|
||||
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
|
||||
the change is a correctness hazard for exactly the kind of failure-detection tuning
|
||||
these options exist for.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Render durations without precision loss — emit milliseconds (e.g.
|
||||
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
|
||||
validate in `StartupValidator` that each cluster timing value is a positive whole
|
||||
number of seconds and fail fast otherwise. Either way, do not silently round.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
|
||||
|
||||
**Description**
|
||||
|
||||
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
|
||||
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
|
||||
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
|
||||
configuration section — but neither `appsettings.json` nor either role-specific file
|
||||
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
|
||||
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
|
||||
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
|
||||
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
|
||||
console output template, the file path, and the rolling interval cannot be changed
|
||||
without recompiling — an operator cannot redirect logs, change the file location, or
|
||||
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
|
||||
call is effectively a no-op because the section it reads does not exist.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Move the console and file sink definitions into a `Serilog` section in
|
||||
`appsettings.json` (with `WriteTo` entries and the output template / file path /
|
||||
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
|
||||
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
|
||||
state the sinks are intentionally code-defined — but the design doc currently says
|
||||
"configuration-driven", so code and doc must be reconciled.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
|
||||
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
|
||||
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
|
||||
on a Central node in production *validates* the schema version and throws if the
|
||||
deployed schema does not match the expected migration set. A schema-version mismatch
|
||||
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
|
||||
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
|
||||
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
|
||||
that is fatal on the first attempt. The retry helper is meant for *transient*
|
||||
connection faults (the XML doc says exactly that: "the database may be briefly
|
||||
unreachable"), but it cannot distinguish transient from permanent failures.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
|
||||
isTransient` predicate and, for the migration call site, treat only
|
||||
connection-class exceptions (`SqlException` with a connection/transport error
|
||||
number, `TimeoutException`, socket errors) as retryable while letting
|
||||
schema-validation / `InvalidOperationException` failures propagate immediately. Add
|
||||
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
|
||||
single attempt.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.InboundAPI` |
|
||||
| Design doc | `docs/requirements/Component-InboundAPI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,6 +30,25 @@ well but there is no coverage of the HTTP endpoint, concurrency, or recompilatio
|
||||
None of the findings are data-loss-class, but the concurrency and trust-model issues
|
||||
are High severity and should be addressed before production use.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 13 findings from the initial review remain `Resolved`; the module source under
|
||||
`src/ScadaLink.InboundAPI` is unchanged since the last InboundAPI fix commit
|
||||
(`8dd7412`), which precedes `39d737e`. This re-review re-walked all 10 checklist
|
||||
categories against the resolved code and surfaced **4 new findings** — none touching
|
||||
the previously-fixed concurrency/trust-model code, but all in areas the first pass
|
||||
did not probe deeply: (1) the `ReturnDefinition` column is loaded onto `ApiMethod`
|
||||
but is never consulted — script return values are serialized verbatim with no shaping
|
||||
or validation against the declared return structure (InboundAPI-014); (2) the new
|
||||
`ForbiddenApiChecker` is a purely textual syntax walker and can be bypassed by
|
||||
reaching forbidden functionality through member access that never spells a forbidden
|
||||
namespace, e.g. `typeof(x).Assembly.GetType("System.IO.File")` (InboundAPI-015);
|
||||
(3) routed `Route.To().Call()` invocations are not bound by the method timeout unless
|
||||
the script explicitly threads `Parameters`-side cancellation, contradicting the design
|
||||
statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteHelper`
|
||||
/ `RouteTarget` — the entire WP-4 cross-site routing surface — has no test coverage
|
||||
(InboundAPI-017). New findings are one Medium-trio plus one Low; no Critical or High.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -38,11 +57,11 @@ are High severity and should be addressed before production use.
|
||||
| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
|
||||
| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
|
||||
| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. |
|
||||
| 5 | Security | ☑ | Prior items resolved. Re-review: `ForbiddenApiChecker` is a textual deny-list bypassable via reflection without a forbidden namespace token (InboundAPI-015). |
|
||||
| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
|
||||
| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". |
|
||||
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. |
|
||||
| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. |
|
||||
| 7 | Design-document adherence | ☑ | Re-review: `ReturnDefinition` loaded but never used (InboundAPI-014); routed-call timeout not enforced (InboundAPI-016). Prior `Database.Connection()`/central-only items resolved. |
|
||||
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` moved to Commons (InboundAPI-012 resolved); no new issues. |
|
||||
| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
|
||||
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
|
||||
|
||||
## Findings
|
||||
@@ -580,3 +599,181 @@ the new method-not-found status, and removing dead code cannot regress. Doc-owne
|
||||
follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a
|
||||
"method not found" status; it should note that it is reported as 403 (indistinguishable
|
||||
from "key not approved"), but that doc edit is outside this module's editable scope.
|
||||
|
||||
### InboundAPI-014 — `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:201-205`, `src/ScadaLink.Commons/Entities/InboundApi/ApiMethod.cs:10` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-InboundAPI.md` ("API Method Definition → Return Value Definition" and the
|
||||
"Response Format" section) specifies that each method has a declared return structure
|
||||
— "Field names and data types … Supports returning lists of objects" — and that the
|
||||
success response body is "the method's return value as JSON, with fields matching the
|
||||
return value definition". The `ApiMethod` entity carries a `ReturnDefinition` column
|
||||
to hold exactly this. However, nothing in the module ever reads `ReturnDefinition`:
|
||||
`ExecuteAsync` takes whatever object the script happens to return and does a blind
|
||||
`JsonSerializer.Serialize(result)`. There is no validation that the script's return
|
||||
value matches the declared shape, no coercion to the declared field types, and no
|
||||
error when a method returns a structure inconsistent with its definition. A method
|
||||
whose script returns the wrong shape (or `null` where a structure is required) will
|
||||
silently emit a malformed 200 response, and the documented return-definition contract
|
||||
is effectively unenforced. This is the response-side mirror of the parameter
|
||||
validation that `ParameterValidator` does perform, leaving the two halves of the
|
||||
method contract asymmetric.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) implement return-value validation/shaping: parse `ReturnDefinition` with
|
||||
the same extended-type machinery used for parameters and validate/coerce the script
|
||||
result before serializing, returning a 500 (or logging) when the script result does
|
||||
not match; or (b) if return shaping is deliberately out of scope, remove the "Return
|
||||
Value Definition" / "fields matching the return value definition" language from
|
||||
`Component-InboundAPI.md` and document that the response is the script's raw return
|
||||
value serialized as-is. Code and design doc must be reconciled.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### InboundAPI-015 — `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/ForbiddenApiChecker.cs:63-119`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:109-126` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ForbiddenApiChecker` walks the script syntax tree and rejects any `using` directive,
|
||||
`QualifiedNameSyntax`, or `MemberAccessExpressionSyntax` whose textual dotted name
|
||||
starts with a forbidden namespace prefix (`System.IO`, `System.Diagnostics`,
|
||||
`System.Reflection`, `System.Net`, etc.). This is a textual match, not a semantic
|
||||
one, and the trust model it enforces (per InboundAPI-005) is explicitly meant to keep
|
||||
*untrusted* Design-role scripts away from host APIs. The check can be bypassed because
|
||||
forbidden functionality is reachable through member access that never spells a
|
||||
forbidden namespace:
|
||||
|
||||
- `typeof(string).Assembly.GetType("System.IO.File")` — `typeof(string)` is permitted,
|
||||
`.Assembly` is a `System.Type` property, `.GetType(string)` is a `System.Reflection.Assembly`
|
||||
method. The string literal `"System.IO.File"` is a string, not a `QualifiedNameSyntax`
|
||||
or `MemberAccessExpressionSyntax`, so `IsForbidden` never sees it. The script obtains
|
||||
a `System.IO.File` `Type` and can `InvokeMember`/`GetMethod(...).Invoke(...)` on it —
|
||||
all via members of permitted types — with no forbidden namespace ever appearing in
|
||||
the source. `CompileAndRegister` references `typeof(object).Assembly`
|
||||
(System.Private.CoreLib) in `ScriptOptions`, so every framework type is loadable at
|
||||
runtime.
|
||||
- The executor also references the `Microsoft.CSharp.RuntimeBinder` assembly
|
||||
(`InboundScriptExecutor.cs:116`), enabling the `dynamic` keyword, which further
|
||||
widens late-bound member access that the static walker cannot see through.
|
||||
|
||||
Because the inbound API script runs on the central node with the host process's
|
||||
privileges and is authored by the (less-trusted-than-Admin) Design role, a static
|
||||
textual deny-list gives a false sense of containment.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Treat the syntax walker as defence-in-depth, not the boundary. Strengthen it where
|
||||
cheap (flag `Assembly.GetType`, `Type.GetType`, `Activator.CreateInstance`,
|
||||
`InvokeMember`, and `dynamic` usage), but for real enforcement run compiled scripts
|
||||
under a genuine boundary — a restricted `AssemblyLoadContext`/AppDomain-equivalent, a
|
||||
curated reference set that does not expose reflection-to-arbitrary-type, or an
|
||||
out-of-process sandbox — consistent with however the Site Runtime ultimately enforces
|
||||
its instance-script trust model. At minimum, document in `Component-InboundAPI.md`
|
||||
that the current check is best-effort and does not stop a determined script.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### InboundAPI-016 — Routed `Route.To().Call()` invocations are not bound by the method timeout
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:59-152`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:177`, `:199` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-InboundAPI.md` states the per-method timeout "defines the maximum time the
|
||||
method is allowed to execute (**including any routed calls to sites**)", and the
|
||||
Routing Behavior section says a routed call "blocks until the site responds or the
|
||||
**method-level timeout** is reached". The executor builds a linked
|
||||
`CancellationTokenSource` (`cts`) combining the request-abort token and a dedicated
|
||||
timeout CTS, and exposes `cts.Token` to the script as `InboundScriptContext.CancellationToken`.
|
||||
However, every `RouteTarget` method (`Call`, `GetAttribute(s)`, `SetAttribute(s)`)
|
||||
takes `CancellationToken cancellationToken = default` and the script must *explicitly*
|
||||
pass the context token for the routed call to honour the timeout. A natural script —
|
||||
`Route.To("inst").Call("doWork", parameters)` — invokes the routed call with
|
||||
`CancellationToken.None`. That request flows into `CommunicationService.RouteToCallAsync`
|
||||
with no cancellation, so the routed call is not bounded by the method timeout at all.
|
||||
The only timeout guard left is `handler(context).WaitAsync(cts.Token)` in
|
||||
`ExecuteAsync`: when the method timeout fires, `WaitAsync` returns a cancellation to
|
||||
the caller, but the underlying script `Task` — and the in-flight `RouteToCallAsync`
|
||||
awaiting a remote site — keeps running orphaned with no cancellation, holding the
|
||||
correlation/communication resources until the site eventually responds or its own
|
||||
transport timeout (if any) fires. The design's guarantee that the method timeout
|
||||
covers routed calls is therefore not met, and a slow/hung site can leak background
|
||||
work past the timeout the caller was told bounds the request.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make routed calls inherit the method deadline without relying on script discipline:
|
||||
have `RouteHelper`/`RouteTarget` carry the executing method's `CancellationToken`
|
||||
(injected by `InboundScriptExecutor` when it constructs the context, e.g. a
|
||||
`RouteHelper` bound to `cts.Token`) and pass it into every `CommunicationService`
|
||||
call by default, so `Route.To("x").Call("s", p)` is timeout-bounded with no token
|
||||
argument. Keep the explicit-token overload for callers that want a tighter bound.
|
||||
Verify `RouteToCallAsync` and the attribute-routing calls actually observe the token
|
||||
and abandon the in-flight request when it fires.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### InboundAPI-017 — `RouteHelper` / `RouteTarget` has no test coverage
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:1-165`, `tests/ScadaLink.InboundAPI.Tests/` |
|
||||
|
||||
**Description**
|
||||
|
||||
`RouteHelper`/`RouteTarget` is the entire WP-4 cross-site routing surface — the
|
||||
`Route.To().Call()/GetAttribute(s)/SetAttribute(s)` API that inbound API scripts use
|
||||
to reach instances at any site. It has zero tests: the `ScadaLink.InboundAPI.Tests`
|
||||
project covers `ApiKeyValidator`, `ParameterValidator`, `InboundScriptExecutor`, and
|
||||
`InboundApiEndpointFilter`, but no test file exercises `RouteHelper`. Untested
|
||||
behaviours include site resolution via `IInstanceLocator` (including the
|
||||
"instance not found / no assigned site" `InvalidOperationException` path at
|
||||
`RouteHelper.cs:154-164`), the `!response.Success` → `InvalidOperationException`
|
||||
translation in each routed method, `GetAttribute` delegating to the batch
|
||||
`GetAttributes` and returning `null` for an absent key, correlation-ID generation,
|
||||
and `SetAttribute` delegating to `SetAttributes`. These are non-trivial branches
|
||||
whose failure modes (a thrown exception inside a script) surface to the caller as a
|
||||
500, so regressions would be silent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `RouteHelperTests` suite using substituted `IInstanceLocator` and
|
||||
`CommunicationService` (the executor tests already substitute `CommunicationService`):
|
||||
cover the happy path of each routed method, the unresolved-instance throw, the
|
||||
`!Success` → `InvalidOperationException` mapping, and `GetAttribute` returning `null`
|
||||
for a missing key. This also gives InboundAPI-016 a regression home if the timeout
|
||||
wiring is added.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ManagementService` |
|
||||
| Design doc | `docs/requirements/Component-ManagementService.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 (1 Deferred — see ManagementService-012) |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 (1 Deferred — see ManagementService-012) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -28,6 +28,24 @@ instances never disposed) and dead/unused configuration. None of the findings ar
|
||||
crash-class, but the site-scope gaps are High severity because they are a real
|
||||
authorization bypass with no workaround.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All thirteen prior findings remain correctly closed; the source under
|
||||
`src/ScadaLink.ManagementService` is byte-identical between the previously reviewed state
|
||||
and `39d737e` (the resolution commits of findings 001–013 are folded into the history at or
|
||||
before `39d737e`). ManagementService-012 was re-checked and its **Deferred** status still
|
||||
holds: `ManagementEnvelope.Command` is still typed `object`, and the marker-interface fix
|
||||
still belongs in the Commons module, outside this module's edit scope — nothing has changed
|
||||
to make it actionable here. This re-review re-ran the full 10-category checklist against the
|
||||
current sources and surfaced **four new findings**. The dominant theme is the same
|
||||
site-scope authorization gap that findings 001/002 closed: `HandleQueryDeployments`
|
||||
(`QueryDeploymentsCommand`) was overlooked by that sweep and still performs no site-scope
|
||||
enforcement, letting a site-scoped Deployment user read deployment history for any site
|
||||
(014, High). The remaining three are lower severity: a non-atomic multi-override mutation
|
||||
that can leave an instance partially modified after an error (015, Medium), raw exception
|
||||
messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
|
||||
`QueryDeploymentsCommand` having no test coverage at all (017, Low).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -551,3 +569,144 @@ infrastructure, so the testable command-parsing/dispatch logic was extracted int
|
||||
`ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5),
|
||||
`SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the
|
||||
pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed.
|
||||
|
||||
### ManagementService-014 — HandleQueryDeployments bypasses site-scope enforcement
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:306`, `:1174` |
|
||||
|
||||
**Description**
|
||||
|
||||
`QueryDeploymentsCommand` is gated to the `Deployment` role by `GetRequiredRole`
|
||||
(`ManagementActor.cs:170`–`:177`), and the design document's Authorization section states
|
||||
"Site scoping is enforced for site-scoped Deployment users" and explicitly lists deployments
|
||||
among the Deployment-role operations. `HandleQueryDeployments` makes no call to
|
||||
`EnforceSiteScope` / `EnforceSiteScopeForInstance` / `EnforceSiteScopeForIdentifier`: with no
|
||||
`InstanceId` it returns `IDeploymentManagerRepository.GetAllDeploymentRecordsAsync()` (every
|
||||
deployment record across all sites), and with an `InstanceId` it returns that instance's
|
||||
deployment history with no check that the instance's site is within the caller's permitted
|
||||
set. A site-scoped Deployment user scoped to site A can therefore enumerate deployment
|
||||
records for instances at site B — instance IDs, `DeployedBy` (operator usernames),
|
||||
timestamps, deployment status, and `ErrorMessage` content — by issuing `QueryDeployments`
|
||||
with or without an out-of-scope `InstanceId`. This is the same authorization-bypass class as
|
||||
the resolved findings 001/002, on a handler that sweep did not cover; it is `DispatchCommand`'s
|
||||
only `Deployment`-role handler with no scope enforcement.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Thread `AuthenticatedUser` into `HandleQueryDeployments` (the dispatch case at line 306
|
||||
already has `user` in scope). When `cmd.InstanceId` is supplied, call
|
||||
`EnforceSiteScopeForInstance` before querying. When it is not supplied, filter the returned
|
||||
`DeploymentRecord` list to the caller's permitted sites — resolve each record's instance to
|
||||
its `SiteId` (or join through a site-aware repository query) and drop records for sites
|
||||
outside `PermittedSiteIds`, mirroring the `HandleListInstances` / `HandleListSites` filter
|
||||
pattern. Add a regression test for a site-scoped user against in-scope and out-of-scope
|
||||
instances.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ManagementService-015 — HandleSetInstanceOverrides applies overrides non-atomically
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:647`–`:659` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HandleSetInstanceOverrides` iterates `cmd.Overrides` and calls
|
||||
`InstanceService.SetAttributeOverrideAsync` once per attribute, throwing
|
||||
`InvalidOperationException` on the first `result.IsSuccess == false`. Each
|
||||
`SetAttributeOverrideAsync` call persists independently, so if the command supplies five
|
||||
overrides and the third fails (e.g. an unknown attribute name or a validation error), the
|
||||
first two overrides are already committed to the configuration database while the caller
|
||||
receives a `ManagementError`. The instance is left partially mutated in a state the operator
|
||||
neither sees nor requested, and the per-instance operation lock referenced in the design's
|
||||
deployment decisions does not protect against this because the partial writes are committed
|
||||
before the throw. A retry of the same command then re-applies the already-applied overrides.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the multi-override mutation all-or-nothing: either validate every requested override up
|
||||
front before applying any, or apply all overrides within a single transaction / unit-of-work
|
||||
so a mid-batch failure rolls back the earlier writes. If `InstanceService` cannot offer a
|
||||
batch method, at minimum document the partial-application behaviour on `SetInstanceOverridesCommand`
|
||||
and have the handler report which overrides were applied before the failure so the caller
|
||||
can reconcile.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ManagementService-016 — Unexpected exception messages returned verbatim to HTTP callers
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:121`–`:131` |
|
||||
|
||||
**Description**
|
||||
|
||||
`MapFault` maps any non-`SiteScopeViolationException` fault to
|
||||
`new ManagementError(correlationId, cause.Message, "COMMAND_FAILED")`, and
|
||||
`ManagementEndpoints.HandleRequest` returns that `Error` string directly in the HTTP 400
|
||||
body. For handler-thrown `InvalidOperationException`s carrying a curated `result.Error`
|
||||
message this is intended and safe. But the same path also surfaces the raw `.Message` of
|
||||
unanticipated exceptions — a `SqlException` (which can include server/database names and
|
||||
constraint details), a `DbUpdateException`, an `ArgumentException` from `Enum.Parse` on a
|
||||
malformed `DataType`/`TriggerType` value, or a `NullReferenceException` — straight to the
|
||||
external CLI/HTTP client. This is a minor internal-detail disclosure surface: the exception
|
||||
text is already logged server-side with full context, so the client does not need the raw
|
||||
message.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Distinguish handler-curated failures from unexpected faults. Have handlers throw a dedicated
|
||||
exception type (e.g. `ManagementCommandException`) for messages that are safe to surface, and
|
||||
in `MapFault` return that message for the known type while returning a generic
|
||||
"An internal error occurred (CorrelationId=...)" string for everything else — the operator
|
||||
can still correlate to the server log via the correlation ID.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ManagementService-017 — QueryDeploymentsCommand has no test coverage
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` |
|
||||
|
||||
**Description**
|
||||
|
||||
`QueryDeploymentsCommand` / `HandleQueryDeployments` is exercised by no test in
|
||||
`ManagementActorTests`. There is no test that it requires the `Deployment` role, no test of
|
||||
the `InstanceId`-filtered versus unfiltered branches, and — because the handler performs no
|
||||
site-scope enforcement at all — no test that would have caught finding 014. The deployment
|
||||
query is one of the operations the design's Authorization section calls out for site
|
||||
scoping, yet it is the only `Deployment`-role command with neither an authorization test nor
|
||||
a site-scope test.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add tests for `QueryDeploymentsCommand`: a role test (Design/no-role caller →
|
||||
`ManagementUnauthorized`), branch coverage for the `InstanceId`-filtered and unfiltered
|
||||
repository calls, and — once finding 014 is fixed — site-scope tests for a site-scoped
|
||||
Deployment user against in-scope and out-of-scope deployment records.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.NotificationService` |
|
||||
| Design doc | `docs/requirements/Component-NotificationService.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 5 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,6 +30,31 @@ plaintext `string` values. Test coverage exercises the happy path and the main
|
||||
error branches but misses the OAuth2 delivery path, the permanent-classification
|
||||
fallback in `DeliverAsync`, and concurrency on the token cache.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed at commit `39d737e` after the NS-001..013 fixes. All twelve prior
|
||||
findings remain closed (eleven Resolved, NS-013 Deferred) and the fixes hold up
|
||||
against the current source — error classification is now typed, the SMTP client is
|
||||
created once and always disconnected/disposed, the S&F `Notification` delivery
|
||||
handler is registered in `AkkaHostedService`, and the OAuth2 token cache is keyed
|
||||
per credential. The re-review surfaced **five new findings**. The recurring theme is
|
||||
**unclassified-exception handling**: the typed `ClassifySmtpError` helper introduced
|
||||
by NS-002/003 returns `Unknown` for anything that is not a recognised SMTP/socket
|
||||
failure, but neither `SendAsync` nor `DeliverBufferedAsync` has a catch-all for the
|
||||
`Unknown` bucket. As a result an OAuth2 token-fetch failure (`HttpRequestException`
|
||||
from `EnsureSuccessStatusCode`, or `InvalidOperationException` from a malformed
|
||||
credential string) escapes `SendAsync` as a raw exception to the calling script
|
||||
(NS-015) and escapes `DeliverBufferedAsync` out of the S&F handler, where the engine
|
||||
treats any thrown exception as transient and retries a permanently-broken config
|
||||
forever (NS-014) — the same defect class NS-008 fixed for address parsing, but the
|
||||
OAuth2 path was never covered. Secondary findings: `AuthenticateAsync` silently
|
||||
proceeds unauthenticated for an unknown `authType` or empty credentials (NS-016);
|
||||
`NotificationOptions` is bound from configuration in two places but never read by
|
||||
any code (NS-017, dead config — NS-007 sourced the timeout/limit from
|
||||
`SmtpConfiguration` instead); and the lazily-created concurrency limiter is read
|
||||
outside its lock, is sized once and never resized on redeployment, and is never
|
||||
disposed (NS-018).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -465,3 +490,108 @@ filed) and the genuinely-missing tests were added:
|
||||
`AkkaHostedService` handler registration), so no further test is required here.
|
||||
|
||||
Module test suite is green at 56 tests.
|
||||
|
||||
### NotificationService-014 — OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:214-228`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:56-84` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DeliverBufferedAsync` is the registered Store-and-Forward delivery handler for `StoreAndForwardCategory.Notification`. Per the S&F handler contract (`StoreAndForwardService.cs:40-53`), a handler must return `true` for success, `false` for permanent failure (park, no further retries), and **throw only for a transient failure** (retry). `DeliverBufferedAsync`'s `try` block (`:214-228`) catches only `SmtpPermanentException`; the comment at `:227` asserts "Transient SMTP errors propagate out of `DeliverAsync`". But `DeliverAsync` can also throw exceptions that `ClassifySmtpError` buckets as `Unknown` — most importantly the OAuth2 path at `:308-312`, where `OAuth2TokenService.GetTokenAsync` throws `HttpRequestException` (from `EnsureSuccessStatusCode` at `OAuth2TokenService.cs:78`) or `InvalidOperationException` (malformed credential triple, or missing `access_token`). None of these is an `SmtpCommandException`/`SmtpProtocolException`/`SocketException`/`TimeoutException`, so they are neither caught nor classified — they propagate straight out of `DeliverBufferedAsync`. The S&F engine treats *any* thrown exception as a transient failure, so a notification whose SMTP config has an invalid client secret (a 401 → `HttpRequestException`, a genuinely **permanent** condition) is retried on every sweep until `MaxRetries` is reached — burning OAuth2 token-endpoint calls and never parking promptly. A malformed credential string (`InvalidOperationException`, never fixable) is likewise retried instead of parked immediately.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a catch-all to `DeliverBufferedAsync` for exceptions that `ClassifySmtpError` classifies as `Unknown`: log and return `false` (park) for non-retryable causes such as `InvalidOperationException` from credential parsing, and decide deliberately whether an OAuth2 `HttpRequestException` should park or retry (a 5xx token endpoint is transient; a 401 is permanent — inspect `HttpRequestException.StatusCode`). Do not let an unclassified exception leave the handler and be silently reinterpreted as transient.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationService-015 — Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96-148`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SendAsync`'s `try`/`catch` around `DeliverAsync` has exactly three catch clauses: `SmtpPermanentException` (`:101`), `OperationCanceledException` when cancellation was requested (`:111`), and `Exception` filtered by `IsTransientSmtpError` (`:116`). There is no catch-all for an exception that is none of those. `ClassifySmtpError` returns `SmtpErrorClass.Unknown` for anything it does not recognise, and `IsTransientSmtpError` only matches `Transient`, so an `Unknown`-classified exception falls through all three clauses and escapes `SendAsync` unhandled. Concretely, the OAuth2 branch at `:308-312` calls `OAuth2TokenService.GetTokenAsync`, which throws `HttpRequestException` (token endpoint unreachable / 4xx / 5xx) or `InvalidOperationException` (malformed `tenant:client:secret` triple, or no `access_token` in the response). Both reach the calling script (an instance/alarm/shared script) as a raw exception of a type the `INotificationDeliveryService` contract never advertises — the design doc says delivery returns either success, a buffered transient result, or a permanent `NotificationResult` error. `Notify.To().Send()` against an OAuth2 config with a bad secret therefore crashes the script instead of returning a clean `NotificationResult(false, ...)`. This is the same defect class NS-008 resolved for malformed addresses, but NS-008's `ValidateAddresses` fix never covered the OAuth2 token path. (A non-cancellation `OperationCanceledException` — e.g. an internal MailKit timeout surfaced as OCE while the caller's token is *not* cancelled — would escape the same way.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a final `catch (Exception ex)` to `SendAsync` that converts any otherwise-unhandled exception into a permanent `NotificationResult(false, ...)` (credential-scrubbed, consistent with NS-009), and log it. Treating an unclassified failure as permanent is the safe default — it returns control to the script rather than crashing it; an OAuth2 transient (token endpoint 5xx) being reported as permanent is an acceptable trade against an unhandled crash, and can be refined by inspecting `HttpRequestException.StatusCode` if transient buffering of token failures is wanted.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationService-016 — `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:46-67` |
|
||||
|
||||
**Description**
|
||||
|
||||
`MailKitSmtpClientWrapper.AuthenticateAsync` returns without authenticating in two silent paths. (1) `if (string.IsNullOrEmpty(credentials)) return;` (`:48-49`) — if an SMTP config has `AuthType` "basic" or "oauth2" but a null/empty `Credentials` value (a misconfigured or partially-deployed row), the method returns and `DeliverAsync` proceeds straight to `SendAsync` on an *unauthenticated* connection. (2) The `switch (authType.ToLowerInvariant())` (`:51-66`) has cases only for `"basic"` and `"oauth2"` and **no `default`** — any other `AuthType` value (a typo, a future "ntlm", an empty string) falls through with no authentication attempted. Additionally, in the `"basic"` case, `credentials.Split(':', 2)` that yields fewer than 2 parts (a credential string with no colon) is silently skipped (`:55-58`). In every case the connection then attempts to send mail unauthenticated: against a relay that does not require auth this masks a misconfiguration; against a server that does require auth it fails later with a less obvious error, and at worst an unauthenticated send succeeds where it should not. Authentication being skipped should never be silent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make missing/unparseable credentials and an unrecognised `AuthType` hard errors: throw a typed exception (e.g. a permanent configuration exception so `SendAsync`/`DeliverBufferedAsync` surface it as a clean failure / park) rather than returning. Add a `default:` arm to the `switch` that throws `ArgumentException` naming the unsupported auth type, and reject a "basic" credential string that does not split into exactly two parts.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationService-017 — `NotificationOptions` is bound from configuration but never read (dead config)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:1-15`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:10-11`, `src/ScadaLink.Host/SiteServiceRegistration.cs:70` |
|
||||
|
||||
**Description**
|
||||
|
||||
`NotificationOptions` (with `ConnectionTimeoutSeconds` and `MaxConcurrentConnections`) is bound from the `ScadaLink:Notification` configuration section in two places — `ServiceCollectionExtensions.AddNotificationService` (`AddOptions<NotificationOptions>().BindConfiguration(...)`) and again in `Host/SiteServiceRegistration.cs:70` (`services.Configure<NotificationOptions>(...)`). However, a repo-wide search shows no code ever injects `IOptions<NotificationOptions>` or otherwise reads either property. When NS-007 enforced the connection timeout and concurrency limit, it sourced both values from the per-site `SmtpConfiguration` entity, not from `NotificationOptions` — so this options class is now entirely dead configuration. Its XML doc still claims it "provides fallback defaults and operational limits", which is misleading: nothing falls back to it. The double binding is also redundant. Dead, falsely-documented configuration invites an operator to set `ScadaLink:Notification:ConnectionTimeoutSeconds` and expect it to take effect, when it has no effect at all.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either delete `NotificationOptions` and both of its registrations, or genuinely wire it in as the documented fallback (e.g. `DeliverAsync` uses `NotificationOptions` values when the `SmtpConfiguration` field is non-positive). If kept, remove the duplicate `Configure` call and correct the XML doc. The NS-007 resolution note already states the `NotificationOptions` fields "remain as operational fallback defaults" — that intent should be implemented or the class removed.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### NotificationService-018 — Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:237-255` |
|
||||
|
||||
**Description**
|
||||
|
||||
Three minor issues with the NS-007 concurrency limiter. (1) `GetConcurrencyLimiter` reads the plain (non-`volatile`) field `_concurrencyLimiter` at `:242` *outside* `_limiterLock` as a fast path. The field is only written inside the lock (`:252`), so under the .NET memory model a concurrent thread is not guaranteed to observe the fully-constructed `SemaphoreSlim` via the lock-free read. In practice on x86/x64 this is benign and the double-checked write inside the lock is correct, but the unsynchronized read of a reference set under a lock is the classic lazy-init memory-model pitfall — use `Volatile.Read`/`LazyInitializer`/`Lazy<T>`. (2) The limiter is sized once from the *first* `SmtpConfiguration` ever seen and the `??=` at `:252` means it is never re-created. The design doc says one SMTP config is deployed per site, but that config can be **redeployed** with a different `MaxConcurrentConnections`; the new value is silently ignored for the lifetime of the (scoped, but potentially long-lived per request) service — and more importantly the limit is captured per `NotificationDeliveryService` instance, so behaviour depends on which scoped instance happens to create it first. (3) `SemaphoreSlim` implements `IDisposable`; `_concurrencyLimiter` is never disposed and `NotificationDeliveryService` does not implement `IDisposable`. A scoped service leaking one undisposed `SemaphoreSlim` per scope is a slow handle leak.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyInitializer.EnsureInitialized`, or at minimum make the field access `Volatile.Read`/`Volatile.Write`. Consider hoisting the limiter to a singleton keyed by site (the limit is a per-site invariant, not a per-scope one) so redeployment with a changed limit and disposal are handled in one place; if it stays on the scoped service, implement `IDisposable` to dispose it.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
+98
-29
@@ -40,34 +40,34 @@ module file and counted in **Total**.
|
||||
| Severity | Open findings |
|
||||
|----------|---------------|
|
||||
| Critical | 0 |
|
||||
| High | 0 |
|
||||
| Medium | 0 |
|
||||
| Low | 0 |
|
||||
| **Total** | **0** |
|
||||
| High | 8 |
|
||||
| Medium | 26 |
|
||||
| Low | 32 |
|
||||
| **Total** | **66** |
|
||||
|
||||
## Module Status
|
||||
|
||||
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|
||||
|--------|---------------|--------|----------------|------|-------|
|
||||
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 8 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
|
||||
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
|
||||
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 16 |
|
||||
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/3 | 6 | 25 |
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 14 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/1 | 4 | 17 |
|
||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
|
||||
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 15 |
|
||||
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/3/1 | 4 | 17 |
|
||||
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 |
|
||||
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 |
|
||||
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 |
|
||||
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 14 |
|
||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 19 |
|
||||
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 17 |
|
||||
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/0 | 2 | 16 |
|
||||
|
||||
## Pending Findings
|
||||
|
||||
@@ -80,14 +80,83 @@ description, location, recommendation — lives in the module's `findings.md`.
|
||||
|
||||
_None open._
|
||||
|
||||
### High (0)
|
||||
### High (8)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
|
||||
| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
|
||||
| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
|
||||
| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
|
||||
| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
|
||||
| ManagementService-014 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments bypasses site-scope enforcement |
|
||||
| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever |
|
||||
| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script |
|
||||
|
||||
### Medium (0)
|
||||
### Medium (26)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CLI-014 | [CLI](CLI/findings.md) | `update` commands require "core" fields, making partial updates impossible |
|
||||
| CentralUI-021 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callback mutates `Dictionary` off the render thread |
|
||||
| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
|
||||
| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
|
||||
| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
|
||||
| ConfigurationDatabase-012 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext |
|
||||
| ConfigurationDatabase-013 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-column encryption silently falls back to an ephemeral (throwaway) key |
|
||||
| DataConnectionLayer-015 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup |
|
||||
| DataConnectionLayer-016 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure |
|
||||
| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
|
||||
| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
|
||||
| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
|
||||
| Host-012 | [Host](Host/findings.md) | `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read |
|
||||
| InboundAPI-014 | [InboundAPI](InboundAPI/findings.md) | `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated |
|
||||
| InboundAPI-015 | [InboundAPI](InboundAPI/findings.md) | `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token |
|
||||
| InboundAPI-016 | [InboundAPI](InboundAPI/findings.md) | Routed `Route.To().Call()` invocations are not bound by the method timeout |
|
||||
| ManagementService-015 | [ManagementService](ManagementService/findings.md) | HandleSetInstanceOverrides applies overrides non-atomically |
|
||||
| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials |
|
||||
| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session |
|
||||
| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout |
|
||||
| SiteEventLogging-012 | [SiteEventLogging](SiteEventLogging/findings.md) | Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted |
|
||||
| SiteRuntime-017 | [SiteRuntime](SiteRuntime/findings.md) | Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors |
|
||||
| StoreAndForward-015 | [StoreAndForward](StoreAndForward/findings.md) | `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" |
|
||||
| StoreAndForward-016 | [StoreAndForward](StoreAndForward/findings.md) | Operator-initiated parked-message retry and discard are not replicated to the standby |
|
||||
| TemplateEngine-015 | [TemplateEngine](TemplateEngine/findings.md) | `RenameCompositionAsync` does not cascade-rename nested derived templates |
|
||||
| TemplateEngine-016 | [TemplateEngine](TemplateEngine/findings.md) | Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules |
|
||||
|
||||
### Low (0)
|
||||
### Low (32)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CLI-015 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface has drifted again in two places |
|
||||
| CLI-016 | [CLI](CLI/findings.md) | `WriteAsTable` derives columns from the first array element only |
|
||||
| CentralUI-023 | [CentralUI](CentralUI/findings.md) | Residual bare `catch {}` blocks swallow JS interop errors |
|
||||
| CentralUI-024 | [CentralUI](CentralUI/findings.md) | Claim lookups use magic strings instead of `JwtTokenService` constants |
|
||||
| CentralUI-025 | [CentralUI](CentralUI/findings.md) | `SessionExpiry` polling/redirect path has no test coverage |
|
||||
| ClusterInfrastructure-010 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator does not enforce `DownIfAlone = true` despite the design doc requiring it |
|
||||
| Commons-013 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryGetIndex` rejects non-`int` index values |
|
||||
| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
|
||||
| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
|
||||
| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
|
||||
| ConfigurationDatabase-014 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Redundant, inconsistent cast on one `HasConversion` call |
|
||||
| DataConnectionLayer-017 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect |
|
||||
| DeploymentManager-017 | [DeploymentManager](DeploymentManager/findings.md) | `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement |
|
||||
| ExternalSystemGateway-017 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null |
|
||||
| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
|
||||
| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
|
||||
| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
|
||||
| Host-013 | [Host](Host/findings.md) | `:F0` rounding of cluster timing values silently degrades sub-second configuration |
|
||||
| Host-014 | [Host](Host/findings.md) | Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) |
|
||||
| Host-015 | [Host](Host/findings.md) | `StartupRetry` retries on every exception type, including permanent failures |
|
||||
| InboundAPI-017 | [InboundAPI](InboundAPI/findings.md) | `RouteHelper` / `RouteTarget` has no test coverage |
|
||||
| ManagementService-016 | [ManagementService](ManagementService/findings.md) | Unexpected exception messages returned verbatim to HTTP callers |
|
||||
| ManagementService-017 | [ManagementService](ManagementService/findings.md) | QueryDeploymentsCommand has no test coverage |
|
||||
| NotificationService-017 | [NotificationService](NotificationService/findings.md) | `NotificationOptions` is bound from configuration but never read (dead config) |
|
||||
| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed |
|
||||
| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas |
|
||||
| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims |
|
||||
| SiteEventLogging-013 | [SiteEventLogging](SiteEventLogging/findings.md) | Keyword search does not escape SQL `LIKE` wildcards in user input |
|
||||
| SiteEventLogging-014 | [SiteEventLogging](SiteEventLogging/findings.md) | Initial purge runs synchronously on the host startup thread |
|
||||
| SiteRuntime-018 | [SiteRuntime](SiteRuntime/findings.md) | `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher" |
|
||||
| SiteRuntime-019 | [SiteRuntime](SiteRuntime/findings.md) | Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor` |
|
||||
| StoreAndForward-017 | [StoreAndForward](StoreAndForward/findings.md) | Retry/Discard activity-log entries hard-code the `ExternalSystem` category |
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Security` |
|
||||
| Design doc | `docs/requirements/Component-Security.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 (1 deferred — Security-008) |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 (1 deferred — Security-008) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -28,6 +28,26 @@ on every refresh, weakening the documented 30-minute idle policy. None of these
|
||||
crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
|
||||
defects that should be fixed before any production deployment.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed the module after the batch-resolution of Security-001..007, 009, 010, 011.
|
||||
Those fixes were verified in place and remain Resolved: the `LdapTransport` enum makes
|
||||
StartTLS reachable, the auth cookie is `SecurePolicy.Always` with a sliding window, the
|
||||
JWT signing key is length-validated at construction, issuer/audience are bound and
|
||||
checked, the DN-injection fallback is RFC 4514-escaped, and the idle-timeout anchor is
|
||||
carried across refreshes. Security-008 (N+1 scope-rule query) remains correctly
|
||||
**Deferred** — `ISecurityRepository` still exposes no per-set scope-rule query, so the
|
||||
N+1 cannot be removed from within `RoleMapper.cs` alone; the deferral condition is
|
||||
unchanged. This pass surfaced **4 new findings** (Security-012..015): two Medium and
|
||||
two Low. The most notable is that a *partial* LDAP outage during login (bind succeeds,
|
||||
group search fails) silently returns an authenticated session with **zero roles**
|
||||
instead of failing the login as the design's LDAP-failure rule requires
|
||||
(Security-012); and that `RefreshToken` re-issues a token without ever checking
|
||||
`IsIdleTimedOut`, so an idle-expired session can be renewed indefinitely if the caller
|
||||
omits the separate idle check (Security-014). The two Low findings concern fragile
|
||||
DN parsing of group names containing escaped commas and an un-trimmed username flowing
|
||||
into the LDAP filter, fallback DN, and JWT claims.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -456,3 +476,141 @@ suite. Remaining gaps closed here: renamed the test file `UnitTest1.cs` →
|
||||
LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases
|
||||
(`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`,
|
||||
`EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green.
|
||||
|
||||
### Security-012 — Partial LDAP failure during login yields a roleless authenticated session
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/LdapAuthService.cs:78-118` |
|
||||
|
||||
**Description**
|
||||
|
||||
In `AuthenticateAsync`, after the user bind succeeds, the group/attribute `Search` is
|
||||
wrapped in a `try`/`catch (LdapException)` that logs a warning and continues — the
|
||||
comment states "Auth succeeded even if attribute lookup failed". The method then returns
|
||||
`new LdapAuthResult(true, displayName, username, groups, null)` with an **empty
|
||||
`groups`** list. Because `RoleMapper.MapGroupsToRolesAsync` derives all roles from that
|
||||
group list, a login during a *partial* LDAP outage (bind reachable, search/group server
|
||||
unavailable, or a transient `LdapException` mid-search) produces a fully authenticated
|
||||
session with **zero roles** — the user is logged in but silently stripped of all
|
||||
permissions. The design's LDAP Connection Failure rule states new logins must **fail**
|
||||
when LDAP is unavailable; this path instead admits the user. The caller cannot
|
||||
distinguish "user genuinely belongs to no mapped groups" from "the group query failed",
|
||||
so it cannot make the documented fail-the-login decision. The `while` loop's inner
|
||||
`catch (LdapException) { break; }` (line 107-111) compounds this: a real error mid-enumeration
|
||||
is indistinguishable from end-of-results and silently truncates the group list.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Treat a failed group/attribute lookup on the *initial login* as an authentication
|
||||
failure: return `Success = false` with a "directory unavailable, try again" message, or
|
||||
add an explicit flag on `LdapAuthResult` (e.g. `GroupLookupSucceeded`) so the caller can
|
||||
apply the design's fail-the-login rule. The inner `while`-loop `catch` should not treat
|
||||
an arbitrary `LdapException` as a benign end-of-results sentinel — only the specific
|
||||
"no more results" condition should break the loop; any other exception should propagate
|
||||
or be surfaced.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Security-013 — `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/LdapAuthService.cs:258-269` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ExtractFirstRdnValue` extracts a group name from a `memberOf` DN by taking the substring
|
||||
between the first `=` and the first `,`. RFC 4514 permits a `,` inside an RDN value when
|
||||
it is backslash-escaped — e.g. `cn=Acme\, Inc Operators,ou=groups,dc=example,dc=com`. The
|
||||
method splits on the *escaped* comma and returns `Acme\` as the group name. `RoleMapper`
|
||||
then matches that mangled name verbatim against configured `LdapGroupMapping.LdapGroupName`
|
||||
values, so a group whose CN legitimately contains a comma silently fails to map to its
|
||||
role — the user loses that role with no error. The same naive parsing also does not strip
|
||||
the trailing backslash escape.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Parse the first RDN with an escape-aware scan: ignore a `,` preceded by an unescaped
|
||||
`\`, and unescape RFC 4514 sequences in the extracted value. Alternatively use the LDAP
|
||||
library's DN-parsing API (`LdapDN` / RDN accessors) rather than hand-rolled `IndexOf`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Security-014 — `RefreshToken` re-issues a token without checking the idle timeout
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/JwtTokenService.cs:156-169` |
|
||||
|
||||
**Description**
|
||||
|
||||
`RefreshToken` carries the existing `LastActivity` claim forward (correct, per the
|
||||
Security-007 fix) and unconditionally issues a fresh token with a new 15-minute expiry.
|
||||
It never calls `IsIdleTimedOut`. The documented 30-minute idle policy therefore depends
|
||||
entirely on the CentralUI request pipeline calling `IsIdleTimedOut` *before* calling
|
||||
`RefreshToken` — two independent, order-dependent checks with no enforcement that they
|
||||
are used together. If a caller refreshes without first checking idle state (an easy
|
||||
mistake, and there is no compiler or API signal preventing it), an idle-expired session
|
||||
is silently renewed: each refresh resets the JWT expiry while `LastActivity` keeps
|
||||
ageing, so the session can be kept alive indefinitely by background refreshes even
|
||||
though the user has been idle for hours. `RefreshToken` is the natural place to enforce
|
||||
the invariant but provides no safety net.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Have `RefreshToken` itself reject a principal that is already past the idle timeout —
|
||||
return `null` (the same "cannot refresh" signal it already uses for missing claims) when
|
||||
`IsIdleTimedOut(currentPrincipal)` is true — so the idle policy holds regardless of
|
||||
caller discipline. Document that an idle-expired principal cannot be refreshed and add a
|
||||
regression test covering refresh of a 31-minute-idle token.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Security-015 — Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Security/LdapAuthService.cs:20-21`, `:80`, `:122`, `:169`, `:193` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AuthenticateAsync` rejects null/whitespace-only usernames via `IsNullOrWhiteSpace`, but
|
||||
an otherwise-valid username with leading or trailing whitespace (`" alice"`, copy-paste
|
||||
artefacts, mobile keyboards) is passed through verbatim. It flows into the LDAP search
|
||||
filter `({attr}={EscapeLdapFilter(username)})`, into the fallback bind DN via
|
||||
`BuildFallbackUserDn`, and — most consequentially — into the returned `LdapAuthResult`
|
||||
and from there into the JWT `Username` claim. Most LDAP directories ignore surrounding
|
||||
whitespace on a search term, so `" alice"` and `"alice"` may both authenticate but yield
|
||||
JWT `Username` claims that differ by whitespace. Any downstream identity comparison
|
||||
(audit-log `who`, site-scope lookups keyed by username, session de-duplication) then
|
||||
sees two distinct identities for the same person.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Trim the username once at the top of `AuthenticateAsync` (after the `IsNullOrWhiteSpace`
|
||||
guard) and use the trimmed value consistently for the filter, the DN, and the
|
||||
`LdapAuthResult`. This normalises the identity that ends up in the JWT claim and audit
|
||||
trail.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.SiteEventLogging` |
|
||||
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -28,16 +28,33 @@ cluster-singleton placement of the handler actor (which can pin to the standby
|
||||
node), missing indexes for common query filters, retention/cap purge not enforcing
|
||||
the requirement strictly, and several documentation/maintainability issues.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed the module at commit `39d737e`. All eleven prior findings remain closed
|
||||
(SiteEventLogging-001..003, 005..011 Resolved; 004 Won't Fix) and the resolutions
|
||||
hold up under inspection — the background writer, lock-guarded `WithConnection`,
|
||||
`auto_vacuum = INCREMENTAL` plus logical-size measurement, the severity index, and
|
||||
the concrete-recorder DI wiring are all present and correct at this commit. The
|
||||
module source is byte-identical between `39d737e` and current `HEAD`, so this review
|
||||
reflects the live code. Three new findings were recorded, all low-to-medium and none
|
||||
regressions of prior fixes. The most notable (SiteEventLogging-012) is a correctness
|
||||
gap left by the SiteEventLogging-005 background-writer rework: when an event cannot
|
||||
be persisted because the logger has been disposed, the returned `Task` is completed
|
||||
*successfully* rather than faulted, so an `await`-ing caller is told a dropped audit
|
||||
event was written. The other two are minor: unescaped SQL `LIKE` wildcards in the
|
||||
keyword-search filter (SiteEventLogging-013) and the initial purge running
|
||||
synchronously on the host startup thread (SiteEventLogging-014).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
|
||||
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). Re-review: dropped events report success (-012); `LIKE` wildcards unescaped in keyword search (-013). |
|
||||
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
|
||||
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
|
||||
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
|
||||
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
|
||||
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). Re-review: initial purge blocks host startup thread (-014). |
|
||||
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
|
||||
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
|
||||
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
|
||||
@@ -529,3 +546,122 @@ explanatory note added to `AddSiteEventLogging` pointing readers to where the ac
|
||||
is actually registered. Documentation/dead-code change only; no regression test was
|
||||
added — the change is a method removal verified by the compiler (no callers) and the
|
||||
full module suite still passing.
|
||||
|
||||
### SiteEventLogging-012 — Dropped events report success: `Task` is completed, not faulted, when the event cannot be persisted
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:160-166,193-197` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LogEventAsync` returns a `Task` that, per the interface XML doc (corrected under
|
||||
SiteEventLogging-009), "completes once the event is durably persisted and faults if
|
||||
the write fails, so callers that `await` it observe success or failure." Two paths
|
||||
break that contract by signalling **success** for an event that was never written:
|
||||
|
||||
1. In `LogEventAsync`, if `_writeQueue.Writer.TryWrite(pending)` fails (the channel
|
||||
has been completed because the logger was disposed), the code calls
|
||||
`pending.Completion.TrySetResult()` — completing the `Task` successfully — even
|
||||
though the comment immediately above acknowledges "there is nowhere to persist the
|
||||
event."
|
||||
2. In `ProcessWriteQueueAsync`, `WithConnection` returns `false` when the logger has
|
||||
been disposed mid-drain. The code does not inspect the returned `written` flag and
|
||||
unconditionally calls `pending.Completion.TrySetResult()`, again reporting success
|
||||
for an event the comment admits "simply cannot be persisted."
|
||||
|
||||
The event log is the site's diagnostic audit trail. A caller that `await`s
|
||||
`LogEventAsync` to confirm a critical event (deployment applied, alarm activated) was
|
||||
recorded will observe a *successful* completion for an event that was silently
|
||||
dropped. This is the same class of defect SiteEventLogging-008 fixed for write
|
||||
*errors* — but the disposed-drop path was left reporting false success. The window
|
||||
is the disposal/shutdown interval, during which shutdown-related events (graceful
|
||||
singleton handover, instance disable) are exactly the events most likely to be
|
||||
enqueued and lost.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
For both paths, fault the `Task` (or complete it with a sentinel failure) instead of
|
||||
`TrySetResult()` — e.g. `pending.Completion.TrySetException(new ObjectDisposedException(...))`
|
||||
— so an `await`-ing caller can distinguish a dropped event from a persisted one.
|
||||
Inspect the `written` flag returned by `WithConnection` in `ProcessWriteQueueAsync`
|
||||
and only call `TrySetResult()` when `written` is `true`. Update the XML doc if a
|
||||
deliberate "drop silently on shutdown" semantics is chosen instead.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteEventLogging-013 — Keyword search does not escape SQL `LIKE` wildcards in user input
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:79-83` |
|
||||
|
||||
**Description**
|
||||
|
||||
The keyword-search filter builds the `LIKE` pattern as `$"%{request.KeywordFilter}%"`
|
||||
and binds it as a parameter. Parameterisation correctly prevents SQL injection, but
|
||||
it does **not** neutralise the `LIKE` metacharacters `%` and `_` inside the
|
||||
user-supplied keyword. A search for a literal `_` (common in event sources and
|
||||
identifiers such as `store_and_forward`, `PLC_1`, or instance IDs) is interpreted as
|
||||
"match any single character", and a `%` matches any run of characters. The design
|
||||
calls keyword search "free-text search on message and source fields ... Useful for
|
||||
finding events by script name, alarm name, or error message" — users will reasonably
|
||||
expect a literal substring match, so a query for `store_and_forward` silently returns
|
||||
events containing `storeXandYforward` and similar false positives. There is no way
|
||||
for the caller to search for a literal underscore or percent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Escape `%`, `_`, and the escape character itself in `request.KeywordFilter` before
|
||||
wrapping it in `%...%`, and append an `ESCAPE` clause to the `LIKE` expression
|
||||
(e.g. `... LIKE $keyword ESCAPE '\'`). Alternatively document that the keyword field
|
||||
accepts `LIKE` wildcard syntax, but a literal-substring match is the behaviour the
|
||||
design implies.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteEventLogging-014 — Initial purge runs synchronously on the host startup thread
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:34-48` |
|
||||
|
||||
**Description**
|
||||
|
||||
`EventLogPurgeService.ExecuteAsync` calls `RunPurge()` (a fully synchronous method
|
||||
that runs `PurgeByRetention` and `PurgeByStorageCap`) *before* the first `await`
|
||||
(`await timer.WaitForNextTickAsync(...)`). A `BackgroundService`'s `ExecuteAsync` is
|
||||
invoked from `StartAsync`, and the host's startup pipeline does not proceed past a
|
||||
`BackgroundService` until its `ExecuteAsync` yields at the first real `await`. Because
|
||||
`RunPurge()` precedes any `await`, the entire initial purge — including a cap-purge
|
||||
that deletes rows in 1000-row batches and runs `PRAGMA incremental_vacuum` until a
|
||||
near-1 GB database is back under the cap — executes inline on the startup thread,
|
||||
blocking host startup (and therefore the `/health/ready` gate) for as long as the
|
||||
purge takes. On a site that has accumulated a large log this can be a multi-second
|
||||
stall during every node start/failover. The class doc states the service "runs on a
|
||||
background thread and does not block event recording" — the startup-thread block is
|
||||
inconsistent with that intent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Yield before the initial purge so it runs on the background scheduler rather than the
|
||||
startup thread — e.g. `await Task.Yield();` as the first statement of `ExecuteAsync`,
|
||||
or move the initial `RunPurge()` to after the first `await timer.WaitForNextTickAsync`
|
||||
(accepting a one-interval delay), or offload it with `await Task.Run(RunPurge, stoppingToken)`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.SiteRuntime` |
|
||||
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -28,6 +28,24 @@ in a comment but ships it anyway). Test coverage exists for the coordinator acto
|
||||
persistence and scripting, but the short-lived execution actors, the replication
|
||||
actor, and the repositories are untested.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
The module was re-reviewed at commit `39d737e`. No source under
|
||||
`src/ScadaLink.SiteRuntime` has changed since the previous review at `9c60592`
|
||||
(the only intervening commits are code-review documentation updates), so all of
|
||||
SiteRuntime-001..013, 015, 016 remain Resolved and SiteRuntime-014 remains
|
||||
Deferred — its Deferred justification (a trigger-evaluation concurrency design
|
||||
decision is required before either recommended fix can land in-module) still
|
||||
holds verbatim against the unchanged `ScriptActor`/`AlarmActor` source. The
|
||||
re-review nonetheless worked through all 10 checklist categories afresh and
|
||||
surfaced three new findings that the prior pass did not record: a cross-thread
|
||||
`Dictionary` enumeration race when the Instance Actor's live `_attributes`
|
||||
dictionary is handed by reference into child `ScriptActor`/`AlarmActor`
|
||||
constructors (SiteRuntime-017, Medium); a stale `ScriptExecutionActor` XML doc
|
||||
that still claims a "dedicated blocking I/O dispatcher" (SiteRuntime-018, Low);
|
||||
and two dead lifecycle handlers in `InstanceActor` that the Deployment Manager
|
||||
never routes to (SiteRuntime-019, Low). Open findings: 3.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -733,3 +751,126 @@ harness is a larger test-infrastructure task tracked separately and out of scope
|
||||
Low-severity coverage finding; the highest-value untested paths the finding called out
|
||||
(script timeout/failure/reply/self-stop) are now covered. Full module suite: 192 tests
|
||||
green.
|
||||
|
||||
### SiteRuntime-017 — Instance Actor's live `_attributes` dictionary is shared by reference into child actor constructors
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:625`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:675`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:83`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:93` |
|
||||
|
||||
**Description**
|
||||
|
||||
`InstanceActor.CreateChildActors` passes the Instance Actor's own mutable
|
||||
`_attributes` field (a plain `Dictionary<string, object?>`) by reference into the
|
||||
`Props.Create(...)` factory for every `ScriptActor` and `AlarmActor` (as the
|
||||
`initialAttributes` constructor argument). Each child constructor then iterates
|
||||
that dictionary to seed its `_attributeSnapshot`:
|
||||
|
||||
```csharp
|
||||
if (initialAttributes != null)
|
||||
foreach (var kvp in initialAttributes)
|
||||
_attributeSnapshot[kvp.Key] = kvp.Value;
|
||||
```
|
||||
|
||||
`Context.ActorOf` returns immediately; the child actor's constructor runs later on
|
||||
the *child's* mailbox thread. Meanwhile the Instance Actor's `PreStart` returns and
|
||||
the Instance Actor begins processing its mailbox — `HandleTagValueUpdate` and
|
||||
`HandleAttributeValueChanged` both mutate `_attributes` (`_attributes[...] = ...`).
|
||||
A DCL tag update that arrives before a child has finished its constructor copy
|
||||
therefore mutates the dictionary on the Instance Actor thread while the child
|
||||
thread is enumerating it. `Dictionary<,>` is explicitly not safe for concurrent
|
||||
read/write: the enumeration can throw `InvalidOperationException` ("collection was
|
||||
modified") — which surfaces as an `ActorInitializationException` and, under the
|
||||
Instance Actor's `SupervisorStrategy`, **stops** the child (the strategy returns
|
||||
`Stop` for `ActorInitializationException`). The script or alarm is then silently
|
||||
absent for the life of the instance. A torn read of an entry is also possible. The
|
||||
window is small but deterministically reachable on a busy site at startup/failover
|
||||
— exactly the staggered-startup scenario the design is most concerned about.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Do not share the live dictionary. Snapshot it on the Instance Actor thread before
|
||||
constructing the child — e.g. pass `new Dictionary<string, object?>(_attributes)`
|
||||
(or an immutable copy) into each `Props.Create`. The copy is made on the Instance
|
||||
Actor thread inside `CreateChildActors`, so it is race-free, and each child gets a
|
||||
private dictionary to seed from.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteRuntime-018 — `ScriptExecutionActor` XML doc still claims a "dedicated blocking I/O dispatcher"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:17` |
|
||||
|
||||
**Description**
|
||||
|
||||
The class-level XML summary on `ScriptExecutionActor` states "Runs on a dedicated
|
||||
blocking I/O dispatcher." That is not what the code does. SiteRuntime-009 was
|
||||
resolved by introducing `ScriptExecutionScheduler` (a bounded dedicated
|
||||
`TaskScheduler`); the *actor itself and its mailbox* run on the **default** Akka
|
||||
dispatcher, and only the script body runs on the scheduler's threads via
|
||||
`Task.Factory.StartNew(..., scheduler)`. The resolution of SiteRuntime-009
|
||||
explicitly chose the `TaskScheduler` route *instead of* a HOCON dispatcher and
|
||||
even removed the "in production, configure a dedicated dispatcher" comments
|
||||
elsewhere — but this stale summary line was missed. A reader is told the actor is
|
||||
on a dedicated dispatcher when it is not, which is misleading when reasoning about
|
||||
mailbox throughput and thread-pool pressure. (`AlarmExecutionActor` does not carry
|
||||
the equivalent claim — its summary only says "Same pattern as ScriptExecutionActor.")
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Correct the summary to describe the actual model: the actor runs on the default
|
||||
dispatcher and the script body is dispatched onto the dedicated
|
||||
`ScriptExecutionScheduler` (SiteRuntime-009). Align the wording with the accurate
|
||||
comment already present at `ScriptExecutionActor.cs:71-73`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### SiteRuntime-019 — Dead `DisableInstanceCommand` / `EnableInstanceCommand` handlers in `InstanceActor`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:113` |
|
||||
|
||||
**Description**
|
||||
|
||||
`InstanceActor`'s constructor registers `Receive<DisableInstanceCommand>` and
|
||||
`Receive<EnableInstanceCommand>` handlers that log and reply with a successful
|
||||
`InstanceLifecycleResponse`. These handlers are unreachable. The Deployment Manager
|
||||
is the only sender of those commands, and `DeploymentManagerActor.HandleDisable` /
|
||||
`HandleEnable` handle the lifecycle entirely themselves — they call
|
||||
`Context.Stop(actor)` (disable) or `CreateInstanceActor(...)` (enable) directly and
|
||||
reply to the original sender from the Deployment Manager. Neither command is ever
|
||||
`Forward`-ed or `Tell`-ed to the Instance Actor. The handlers are dead code, and
|
||||
they are actively misleading: a maintainer reading `InstanceActor` would reasonably
|
||||
believe disable/enable is partly an Instance-Actor responsibility, and the no-op
|
||||
"true" reply implies an instance-side acknowledgement contract that does not exist.
|
||||
If a future change *did* route these commands here, the disable handler would do
|
||||
nothing useful (it does not stop children or tear down state — Akka does that when
|
||||
the parent stops the actor).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Remove the two `Receive<...>` registrations and their handler bodies from
|
||||
`InstanceActor`, since the Deployment Manager owns the disable/enable lifecycle.
|
||||
If the intent is to keep them for a future instance-side hook, add an XML comment
|
||||
stating that the Deployment Manager currently handles these and the handlers are a
|
||||
reserved placeholder — but removal is preferred.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.StoreAndForward` |
|
||||
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -30,20 +30,45 @@ status set, and untested critical paths (retry-due timing, replication-from-acti
|
||||
the actor bridge). None of the findings are blockers for compilation, but the
|
||||
replication and retry-count issues are functional defects against the design.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
|
||||
003–010, plus 013–014, are confirmed `Resolved` against the current source: the
|
||||
replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
|
||||
retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
|
||||
transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
|
||||
hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
|
||||
directory creation are all present as described. Findings 002, 011 and 012 were
|
||||
re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
|
||||
residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
|
||||
|
||||
This pass surfaced **three new findings**. StoreAndForward-015 records the
|
||||
StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
|
||||
ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
|
||||
as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
|
||||
forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
|
||||
passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
|
||||
S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
|
||||
retry and discard are not replicated to the standby — only the add/remove/park sweep
|
||||
paths are — so a failover diverges the standby buffer from the active one.
|
||||
StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
|
||||
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
|
||||
the site event log.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
|
||||
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
|
||||
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
|
||||
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
|
||||
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
|
||||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
|
||||
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
|
||||
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
|
||||
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
|
||||
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
|
||||
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
|
||||
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -703,3 +728,158 @@ and bare filenames are skipped). Regression test
|
||||
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
|
||||
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
|
||||
`StoreAndForward-014`.
|
||||
|
||||
### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114`–`:130`, `:285` |
|
||||
|
||||
**Description**
|
||||
|
||||
The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
|
||||
recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
|
||||
its callers, and the StoreAndForward side carries a genuine documentation/API-contract
|
||||
fault:
|
||||
|
||||
- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
|
||||
(`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
|
||||
sweep, so a `0` value means **"no limit — retry forever"**. The
|
||||
`StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
|
||||
`"0 = no limit"`, so the persistence model is internally consistent.
|
||||
- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
|
||||
`maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
|
||||
documentation at all**, and its method summary (lines 114–122) describes the lifecycle
|
||||
only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
|
||||
`_deliveryHandlers` field doc, line 50–51). Nothing on the public surface tells a
|
||||
caller that passing `0` flips the meaning from "park immediately / never retry" to
|
||||
"retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
|
||||
retries means zero retries.
|
||||
- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
|
||||
`DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
|
||||
intending `0` to mean "never retry", and instead get unbounded retry — the
|
||||
duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
|
||||
warns against. The fault is not solely ESG's: the S&F public API silently overloads
|
||||
`0` with the opposite of its natural reading and does not document it.
|
||||
|
||||
The defect is in this module's API contract and documentation, so it is recorded here
|
||||
in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
|
||||
"no retry" is the cross-module design decision tracked by ESG-015; this finding is
|
||||
specifically that the StoreAndForward public surface fails to document whichever
|
||||
meaning is chosen.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
|
||||
that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
|
||||
(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
|
||||
`0` case to the method summary's lifecycle description. Better still — and consistent
|
||||
with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
|
||||
a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
|
||||
park on the first sweep) so the natural reading and the behaviour agree. Either way the
|
||||
public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339`–`:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131`–`:136` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
|
||||
`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
|
||||
replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
|
||||
and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
|
||||
`_replication`:
|
||||
|
||||
- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
|
||||
`retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
|
||||
`Parked`.
|
||||
- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
|
||||
copy is left in place, still `Parked`.
|
||||
|
||||
The Component design doc ("Persistence") requires the active node to forward "each
|
||||
buffer operation (add, remove, park)" so that on failover "the new active node has a
|
||||
near-complete copy of the buffer." An operator retrying a parked message is a buffer
|
||||
state change; an operator discarding one is a removal. After a failover that follows an
|
||||
operator action:
|
||||
|
||||
1. A **discarded** message reappears on the new active node — it is still `Parked`
|
||||
there, so it resurfaces in the central UI's parked-message list and an operator must
|
||||
discard it a second time. For a message deliberately removed (e.g. a known-bad
|
||||
payload) this is a correctness regression of the operator's intent.
|
||||
2. A **retried** message is still `Parked` on the new active node, so the operator's
|
||||
"move it back to the queue" action is silently lost across the failover and the
|
||||
message is not re-attempted.
|
||||
|
||||
`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
|
||||
"un-park / move back to pending", so even a minimal fix needs either a new operation
|
||||
type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
|
||||
as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
|
||||
paths.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replicate both operator actions. `DiscardParkedMessageAsync` should call
|
||||
`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
|
||||
`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
|
||||
`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
|
||||
resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
|
||||
the updated message and replicate it as an `Add`-style upsert. Add replication tests for
|
||||
both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
|
||||
paths).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
|
||||
|
||||
**Description**
|
||||
|
||||
`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
|
||||
notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
|
||||
`StoreAndForwardCategory.ExternalSystem` as the category argument:
|
||||
|
||||
```csharp
|
||||
RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
|
||||
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
|
||||
```
|
||||
|
||||
Both methods take only a `messageId` and never load the message, so they have no access
|
||||
to its real category. When an operator retries or discards a parked **Notification** or
|
||||
**CachedDbWrite** message, the site event log records the activity under
|
||||
`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
|
||||
true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
|
||||
inconsistent and produce mislabelled audit entries — misleading when an operator later
|
||||
filters or reviews S&F activity by category.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
|
||||
`DiscardParkedMessageAsync` return the affected row's category) and pass the real
|
||||
`Category` to `RaiseActivity`. If loading the row is considered too costly on these
|
||||
infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
|
||||
allow a nullable category for management actions rather than asserting a false one.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.TemplateEngine` |
|
||||
| Design doc | `docs/requirements/Component-TemplateEngine.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -29,11 +29,30 @@ create, optimistic concurrency on instance state) are claimed but not implemente
|
||||
Themes: validation that is weaker than the design promises, and asymmetric handling
|
||||
of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
Re-reviewed the whole module against all ten checklist categories at commit
|
||||
`39d737e`. All fourteen prior findings remain closed — the batch-4 fixes
|
||||
(`bc88a36`/`804697f` and predecessors) hold up: the recursive composition walk,
|
||||
the per-slot alarm override mechanism, the code-region-aware delimiter scanner,
|
||||
and the single-source deletion-constraint logic are all correctly in place. Two
|
||||
new Medium findings surfaced, both in the composition-cascade path and both
|
||||
affecting **nested** (depth ≥ 2) compositions specifically — the same blind spot
|
||||
that produced TemplateEngine-001. **TemplateEngine-015**: `RenameCompositionAsync`
|
||||
renames only the directly slot-owned derived template, leaving cascaded inner
|
||||
derived templates with a stale dotted-path name. **TemplateEngine-016**:
|
||||
`FlatteningService` hard-codes `ScriptScope.ParentPath` to the empty string for
|
||||
every composed script regardless of nesting depth, so a script two or more
|
||||
levels deep cannot resolve `Parent.X` references to its real parent module.
|
||||
Both are limited-impact (nested compositions are the less common case and there
|
||||
is design-time visibility) but represent genuine drift from the recursive-nesting
|
||||
design promise.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. |
|
||||
| 1 | Correctness & logic bugs | ✓ | Prior bugs (001–005, 013) all resolved and verified. Re-review 2026-05-17 found two new nested-composition defects: rename does not cascade (TemplateEngine-015), composed-script `ParentPath` always empty (TemplateEngine-016). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. |
|
||||
| 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. |
|
||||
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. |
|
||||
@@ -648,3 +667,102 @@ reports all blocking reasons and uses `TemplateDeletionService`'s phrasing — t
|
||||
affected `TemplateServiceTests` delete tests were updated to the unified messages,
|
||||
and a regression test `DeleteTemplate_MultipleConstraints_ReportsAllNotJustFirst`
|
||||
verifies all three constraint categories are surfaced together.
|
||||
|
||||
### TemplateEngine-015 — `RenameCompositionAsync` does not cascade-rename nested derived templates
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:680` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddCompositionAsync` builds a cascade of derived templates whose names follow a
|
||||
dotted path: composing `$Sensor` (which itself composes `$Probe` as `Probe1`)
|
||||
into `$Pump` as `TempSensor` produces `$Pump.TempSensor` **and** the nested
|
||||
`$Pump.TempSensor.Probe1` (see `CreateCascadedCompositionAsync` and the
|
||||
`AddComposition_CascadesChildCompositions` test). `RenameCompositionAsync`,
|
||||
however, renames only the **directly** slot-owned derived template:
|
||||
|
||||
```csharp
|
||||
var derived = await _repository.GetTemplateByIdAsync(composition.ComposedTemplateId, ...);
|
||||
if (derived != null && derived.IsDerived && derived.OwnerCompositionId == compositionId)
|
||||
{
|
||||
var newDerivedName = $"{owner.Name}.{newInstanceName}";
|
||||
...
|
||||
derived.Name = newDerivedName;
|
||||
await _repository.UpdateTemplateAsync(derived, ...);
|
||||
}
|
||||
```
|
||||
|
||||
There is no recursion into `derived.Compositions`. After renaming the `TempSensor`
|
||||
slot to `MainSensor`, the parent derived becomes `$Pump.MainSensor` but the
|
||||
cascaded child stays `$Pump.TempSensor.Probe1` — its name no longer reflects the
|
||||
slot path it lives under, breaking the dotted-path naming invariant the cascade
|
||||
otherwise maintains. `DeleteCompositionAsync` correctly recurses
|
||||
(`CascadeDeleteDerivedAsync`), so rename is the asymmetric outlier. The
|
||||
`RenameComposition_RenamesSlotAndDerivedTemplate` test only exercises a
|
||||
single-level derived, so the gap is untested. The stale name also breaks the
|
||||
`AddComposition_DerivedNameCollision_Fails` / cascade-name pre-check on any
|
||||
subsequent compose that walks the now-inconsistent name tree.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Recurse over `derived.Compositions` (mirroring `CascadeDeleteDerivedAsync`),
|
||||
re-deriving each cascaded child's name from the renamed parent
|
||||
(`$"{parentDerivedName}.{childComposition.InstanceName}"`), and run the
|
||||
existing same-name collision pre-check across every name the cascade will
|
||||
produce — not just the top-level one. Add a regression test covering a
|
||||
two-level cascade rename.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### TemplateEngine-016 — Composed-script `ScriptScope.ParentPath` is always empty, breaking `Parent.X` resolution for nested modules
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:750` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ResolveComposedScriptsRecursive` assigns each composed script a `ScriptScope`:
|
||||
|
||||
```csharp
|
||||
Scope = new Commons.Types.Scripts.ScriptScope(SelfPath: prefix, ParentPath: "")
|
||||
```
|
||||
|
||||
`prefix` is the accumulated path-qualified module path (`Outer` at depth 1,
|
||||
`Outer.Inner` at depth 2, etc.), so `SelfPath` is correct. `ParentPath`, however,
|
||||
is hard-coded to the empty string at every depth. Per `ScriptScope`'s own XML
|
||||
doc, `ParentPath` is "computed at flattening time and seeded into the script's
|
||||
globals … so `Attributes["X"]` / `Parent.X` can prepend the right path-prefix."
|
||||
For a script directly composed at depth 1 the parent is the root and `""` is
|
||||
correct, but for a script in a nested module (`Outer.Inner.Foo`) the parent
|
||||
module is `Outer` — yet `ParentPath` is still `""`. A nested composed script
|
||||
that references `Parent.X` will therefore resolve the reference against the root
|
||||
flat namespace instead of its actual parent module, reading the wrong attribute
|
||||
(or failing to find one). This is the same depth-≥2 nesting blind spot as
|
||||
TemplateEngine-001; the recursive walk was added there but the `Scope`
|
||||
construction was not updated to carry the parent path. `ResolveComposedScripts`
|
||||
for direct (root-template) scripts leaves `Scope` at the default `ScriptScope.Root`,
|
||||
which is correct.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Thread the parent module path through `ResolveComposedScriptsRecursive` (the
|
||||
caller already knows it — it is the `prefix` of the enclosing recursion frame,
|
||||
or `""` for a depth-1 composition) and set
|
||||
`ParentPath` to that value, so `SelfPath = "Outer.Inner"` pairs with
|
||||
`ParentPath = "Outer"`. Add a flattening test asserting the `Scope` of a
|
||||
two-level composed script.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user