docs(code-reviews): re-review batch 1 at 39d737e — CentralUI, CLI, ClusterInfrastructure, Commons, Communication

17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.
This commit is contained in:
Joseph Doherty
2026-05-17 00:41:21 -04:00
parent 39d737ebd6
commit e49846603e
6 changed files with 842 additions and 52 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.CLI` | | Module | `src/ScadaLink.CLI` |
| Design doc | `docs/requirements/Component-CLI.md` | | Design doc | `docs/requirements/Component-CLI.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-16 | | Last reviewed | 2026-05-17 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `9c60592` | | Commit reviewed | `39d737e` |
| Open findings | 0 | | Open findings | 3 |
## Summary ## Summary
@@ -31,8 +31,26 @@ ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliCon
`CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON `CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON
argument parsing, and the command-tree wiring are untested. argument parsing, and the command-tree wiring are untested.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 prior findings are confirmed resolved — the resilience gaps, the dead format
configuration, the credential-handling weakness, and the test-coverage holes have all
been closed, and the test suite has grown substantially (`CommandTreeTests`,
`ManagementHttpClientTests`, `DebugStreamTests`, etc.). The CLI's runtime behaviour is now
solid. This re-review walked all 14 command groups against the full checklist and found
three new issues, all rooted in **update-command and design-document drift** rather than
runtime defects: every `update` command requires the entity's "core" fields (`--name`,
`--script`) even though the design doc presents them as optional, so a partial update is
impossible (CLI-014); the design doc's command surface has drifted again in two specific
places — `template composition delete` and the `data-connection` config flags (CLI-015);
and `WriteAsTable` derives table columns from only the first array element, silently
dropping columns for any later element with a different shape (CLI-016). No
Critical/High issues; the module remains healthy.
## Checklist coverage ## Checklist coverage
_Original review (2026-05-16, `9c60592`):_
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
|---|----------|----------|-------| |---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). | | 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). |
@@ -46,6 +64,21 @@ argument parsing, and the command-tree wiring are untested.
| 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). | | 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). |
| 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). | | 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). |
_Re-review (2026-05-17, `39d737e`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Update commands require "core" fields, blocking partial updates (CLI-014); `WriteAsTable` headers derived from first array element only (CLI-016). |
| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR client. No issues. |
| 3 | Concurrency & thread safety | ☑ | `debug stream` concurrency now resolved via `DebugStreamHelpers` (CLI-011/012). No new issues. |
| 4 | Error handling & resilience | ☑ | Malformed-URL / malformed-JSON / connect-cancellation paths all hardened (CLI-004/005/010). No new issues. |
| 5 | Security | ☑ | Env-var credential fallback in place (CLI-006). Basic Auth over HTTP is by design. No new issues. |
| 6 | Performance & resource management | ☑ | `CancellationTokenSource` now `using`-scoped (CLI-011). No new issues. |
| 7 | Design-document adherence | ☑ | Two residual command-surface drifts: `template composition delete` and `data-connection --primary-config` (CLI-015). |
| 8 | Code organization & conventions | ☑ | Consistent and clean; option construction centralised in `CliOptions`. No new issues. |
| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
## Findings ## Findings
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken ### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
@@ -537,3 +570,121 @@ Resolved 2026-05-16 (commit pending). Coverage gaps confirmed and closed:
`InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by `InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by
`InstanceArgumentParsingTests` under CLI-005. `InstanceArgumentParsingTests` under CLI-005.
The CLI test suite went from 42 to 77 passing tests. The CLI test suite went from 42 to 77 passing tests.
### CLI-014 — `update` commands require "core" fields, making partial updates impossible
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/TemplateCommands.cs:77`, `src/ScadaLink.CLI/Commands/SiteCommands.cs:86`, `src/ScadaLink.CLI/Commands/ExternalSystemCommands.cs:40-42`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:39-40`, `src/ScadaLink.CLI/Commands/NotificationCommands.cs:40-41`, `src/ScadaLink.CLI/Commands/ApiMethodCommands.cs:79` |
**Description**
The design doc presents `update` commands with all non-`--id` fields as optional, e.g.
`template update --id <id> [--name <name>] [--description <desc>] [--parent-id <id>]`
(`Component-CLI.md:62`) and `api-method update --id <id> [--script <code>] ...`
(`Component-CLI.md:224`). The implementation contradicts this: the update commands mark
the entity's "core" fields as `Required = true`, so the user must always re-supply them:
- `template update``--name` is `Required = true` (`TemplateCommands.cs:77`).
- `site update``--name` is `Required = true` (`SiteCommands.cs:86`).
- `external-system update``--name`, `--endpoint-url`, `--auth-type` are all
`Required = true` (`ExternalSystemCommands.cs:40-42`).
- `data-connection update``--name`, `--protocol` are `Required = true`
(`DataConnectionCommands.cs:39-40`).
- `notification update``--name`, `--emails` are `Required = true`
(`NotificationCommands.cs:40-41`).
- `api-method update``--script` is `Required = true` (`ApiMethodCommands.cs:79`).
- The same pattern applies to `template attribute/alarm/script update`
(`TemplateCommands.cs:164-165`, `246-248`, `332-334`) and `role-mapping update`
(`SecurityCommands.cs:110-111`).
Because the corresponding `Update*Command` records are whole-replace (they carry the full
field set, not a sparse patch), a user who wants to change only one field — e.g. flip an
API method's timeout, or rename a template — must look up and re-pass every other field's
current value. Omitting any required flag is a hard parse error. This makes scripted,
single-field updates (a core CLI/CI use case) awkward and error-prone, and it does not
match the documented optional-flag surface.
**Recommendation**
Decide on one model and align doc + code. Either (a) make the update flags genuinely
optional and have the server/`Update*Command` treat a null field as "leave unchanged"
(sparse patch), or (b) if whole-replace is intentional, update `Component-CLI.md` to show
these flags as required (no `[...]`) and document that an update replaces the whole
entity. Option (a) matches the documented surface and the typical CLI expectation.
**Resolution**
_Unresolved._
### CLI-015 — `Component-CLI.md` command surface has drifted again in two places
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-CLI.md:75`, `docs/requirements/Component-CLI.md:125-126` (vs. `src/ScadaLink.CLI/Commands/TemplateCommands.cs:404-413`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:41`, `:86`) |
**Description**
CLI-007 regenerated the doc's "Command Structure" section, but two specific drifts remain
or were introduced:
- `Component-CLI.md:75` documents `template composition delete --template-id <id>
--instance-name <name>`, but the implementation (`TemplateCommands.cs:404-413`) deletes
a composition by its own integer ID via a single `--id` option
(`DeleteTemplateCompositionCommand(id)`). The doc's two-flag form does not exist.
- `data-connection create` and `update` accept a `--primary-config` option (aliased
`--configuration`) for the primary configuration JSON (`DataConnectionCommands.cs:86`,
`:41`), but `Component-CLI.md:125-126` lists only `--backup-config` and
`--failover-retry-count` — the primary-config flag is absent from the doc.
A reader following the doc would use a non-existent `template composition delete` form and
would not discover the `--primary-config` flag.
**Recommendation**
Correct `Component-CLI.md:75` to `template composition delete --id <id>`, and add
`[--primary-config <json>]` to the documented `data-connection create`/`update` signatures
(`Component-CLI.md:125-126`). Also note the `--configuration` alias if aliases are
documented elsewhere.
**Resolution**
_Unresolved._
### CLI-016 — `WriteAsTable` derives columns from the first array element only
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:184-200` |
**Description**
When rendering a JSON array as a table, `WriteAsTable` builds the header set from
`items[0].EnumerateObject()` only (`CommandHelpers.cs:184-186`) and then projects every
row against that fixed header list (`:188-198`). If a later element of the array has a
different shape — additional properties, or properties the first element lacks — those
extra columns are silently dropped from the table and a row missing a header property
renders an empty cell. The user sees a table that appears complete but has omitted data,
with no indication that columns were discarded. (The JSON output path is unaffected; this
only affects `--format table`.) Management API list responses are generally homogeneous,
so the practical impact is low, but a heterogeneous array — e.g. a diff or a mixed-status
list — would be rendered incorrectly with no warning.
**Recommendation**
Compute the header set as the union of property names across all array elements (iterate
all items, collect distinct property names preserving first-seen order) before projecting
rows, so no element's data is silently dropped.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.CentralUI` | | Module | `src/ScadaLink.CentralUI` |
| Design doc | `docs/requirements/Component-CentralUI.md` | | Design doc | `docs/requirements/Component-CentralUI.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-16 | | Last reviewed | 2026-05-17 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `9c60592` | | Commit reviewed | `39d737e` |
| Open findings | 0 | | Open findings | 6 |
## Summary ## Summary
@@ -32,6 +32,24 @@ Testing coverage is thin for a module this large: only the script analyzer,
TreeView, schema model, and a few data-connection pages have unit tests; most TreeView, schema model, and a few data-connection pages have unit tests; most
pages and the auth bridge are untested. pages and the auth bridge are untested.
#### Re-review 2026-05-17 (commit `39d737e`)
All 19 findings from the 2026-05-16 review are confirmed closed. The resolution
batch (`a9bd7ee`..`34588ae`) substantially rewrote the auth bridge, the script
sandbox, several Deployment/Monitoring pages, and the shared component disposal
paths, so this re-review re-examined the post-fix state across all 10 checklist
categories. Six new findings (CentralUI-020 .. 025) were recorded. The most
important is **CentralUI-020**: the two prior fixes interact destructively — the
CentralUI-004 fix made `CookieAuthenticationStateProvider` return a frozen,
constructor-time auth-state snapshot, while the CentralUI-005 fix rewrote
`SessionExpiry.razor` to *poll* that same provider to detect a lapsed session.
Because the snapshot never changes for the life of the circuit, the idle-timeout
redirect can never fire, so the documented idle-logout behaviour is silently
defeated. The remaining new findings are a cross-thread `Dictionary` mutation in
`DebugView`, an unguarded `InvokeAsync` in the new `Deployments` push handler,
and three Low-severity items (residual bare `catch`, magic-string claim
lookups, and the untested `SessionExpiry` polling path).
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
@@ -47,6 +65,14 @@ pages and the auth bridge are untested.
| 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. | | 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. |
| 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. | | 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. |
Re-review 2026-05-17 (`39d737e`): all 10 categories re-examined against the
post-fix source. New findings — category 3 (CentralUI-020 the auth-snapshot vs
session-poll interaction is also a design-adherence regression; CentralUI-021
cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
(CentralUI-023 residual bare `catch`), category 8 (CentralUI-024 magic-string
claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
1, 2, 5, 6, 7, 10 produced no new findings.
## Findings ## Findings
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement ### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
@@ -961,3 +987,232 @@ it was failing on the baseline because `DeploymentService` had gained a
`DiffService` constructor dependency from a DeploymentManager contract change `DiffService` constructor dependency from a DeploymentManager contract change
that the test fixture had not been updated for; `DiffService` is now registered that the test fixture had not been updated for; `DiffService` is now registered
in the fixture.) in the fixture.)
### CentralUI-020 — Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:39-62`; `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:29-43` |
**Description**
The CentralUI-004 fix and the CentralUI-005 fix interact destructively.
CentralUI-004 made `CookieAuthenticationStateProvider` snapshot the principal
**once** in its constructor into a cached `Task<AuthenticationState>` and serve
that exact task for the entire life of the SignalR circuit — it never re-reads
`HttpContext`, never calls `SetAuthenticationState`, and never raises
`NotifyAuthenticationStateChanged`. CentralUI-005 then rewrote
`SessionExpiry.razor` to *poll* `AuthStateProvider.GetAuthenticationStateAsync()`
once a minute and redirect to `/login` "once the sliding cookie has actually
lapsed server-side." But `GetAuthenticationStateAsync()` returns the same frozen
constructor-time snapshot on every call — `auth.User.Identity.IsAuthenticated`
is permanently `true` for the life of the circuit regardless of whether the
server-side cookie has expired. The poll loop therefore never observes an
expired session and the redirect never fires. An idle user whose cookie has
lapsed server-side keeps an authenticated-looking page open indefinitely; the
documented "30-minute idle timeout" is silently defeated for any user who
leaves a circuit open. (The cookie middleware would still reject the *next*
full HTTP request / new circuit, so this is a stale-UI / missed-logout exposure
rather than a full auth bypass — but the page continues to render
authenticated content and a SignalR circuit can stay alive for a long time.)
This is also a design-document-adherence regression against CLAUDE.md
"Security & Auth" (30-minute idle timeout) — recorded under Concurrency because
the root cause is the lifetime/staleness mismatch between the two components.
**Recommendation**
`SessionExpiry` must consult something that actually reflects the live
server-side session, not the circuit's frozen principal. Options: (a) have
`SessionExpiry` poll a lightweight authenticated server endpoint (e.g. a
`/auth/ping` minimal API that returns 401 once the cookie has lapsed) and
redirect on 401; or (b) give `CookieAuthenticationStateProvider` a refresh path
that re-validates the cookie and calls `SetAuthenticationState` /
`NotifyAuthenticationStateChanged` so the polled state can actually change.
Whichever is chosen, add a test that exercises the redirect path with an
expired session (see CentralUI-025).
**Resolution**
_Unresolved._
### CentralUI-021 — `DebugView` stream callback mutates `Dictionary` off the render thread
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:404-419,511-519,275-289` |
**Description**
The `onEvent` callback passed to `DebugStreamService.StartStreamAsync` runs on
an Akka/gRPC thread (as the design doc and the CentralUI-009 comments state). It
calls `UpsertWithCap(_attributeValues, …)` / `UpsertWithCap(_alarmStates, …)`
**directly on that thread** — the mutation is not marshalled through
`InvokeAsync`; only the subsequent `StateHasChanged` is. Meanwhile the render
thread evaluates `FilteredAttributeValues` / `FilteredAlarmStates`, which
enumerate `_attributeValues.Values` / `_alarmStates.Values` and call
`OrderBy(...).ToList()`. `Dictionary<TKey,TValue>` is not thread-safe: a write
on the Akka thread concurrent with an enumeration on the render thread can throw
`InvalidOperationException` ("Collection was modified; enumeration operation may
not execute") or corrupt the dictionary's internal buckets. The CentralUI-009
fix added a `_disposed` guard but did not address this data race — the guard
only prevents touching a *disposed* component, not concurrent access to a live
one. Under a busy debug stream this will intermittently fault the page.
**Recommendation**
Marshal the dictionary mutation onto the render thread too — move the
`UpsertWithCap` call inside the `SafeInvokeAsync`/`InvokeAsync` body so all
access to `_attributeValues`/`_alarmStates` happens on the renderer's
dispatcher. Alternatively guard both the writes and the `Filtered*` reads with a
lock, or use a concurrent collection. The cap-trim loop must be inside the same
critical section as the upsert.
**Resolution**
_Unresolved._
### CentralUI-022 — `Deployments` push handler fires `InvokeAsync` with no disposal guard
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:221-229,317-322` |
**Description**
`OnDeploymentStatusChanged` is invoked by `IDeploymentStatusNotifier`, a process
singleton, on the DeploymentManager service thread. The handler does
`_ = InvokeAsync(async () => { await LoadDataAsync(); StateHasChanged(); })`,
discarding the returned task. `Dispose()` unsubscribes the handler, but there is
a race window: the notifier can read the subscriber list and begin invoking
`OnDeploymentStatusChanged` *just before* the component is disposed, so
`InvokeAsync` then runs against a disposed component and throws
`ObjectDisposedException` on the DeploymentManager thread — an unobserved task
exception (the task is fire-and-forget). The same hazard was explicitly fixed
for `DebugView` (CentralUI-009, `SafeInvokeAsync` + `_disposed` flag) and
`ToastNotification` (CentralUI-010), but the new push-based `Deployments`
handler introduced by the CentralUI-006 fix did not adopt the same guard.
Separately, every push event triggers two full repository reloads
(`GetAllInstancesAsync` + `GetAllDeploymentRecordsAsync`) for every open
circuit, so a burst of status changes amplifies into N×2 round-trips per tick.
**Recommendation**
Add a `volatile bool _disposed` set first in `Dispose()`, have
`OnDeploymentStatusChanged` no-op when set, and wrap the `InvokeAsync` dispatch
to swallow `ObjectDisposedException` (mirror `DebugView.SafeInvokeAsync`).
Optionally coalesce bursts (debounce) and/or reload only the changed record
rather than the whole table on each event.
**Resolution**
_Unresolved._
### CentralUI-023 — Residual bare `catch {}` blocks swallow JS interop errors
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:690-698`; `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:107-116,118-130,104` |
**Description**
CentralUI-018 narrowed the bare `catch {}` blocks in `MonacoEditor`,
`TreeView`, and `Sites.razor`, but the same pattern survives elsewhere.
`ParkedMessages.CopyAsync` wraps `navigator.clipboard.writeText` in
`catch { _toast.ShowError("Copy failed."); }` — a real `JSException`
(clipboard permission denied) and an expected `JSDisconnectedException` are
treated identically and neither is logged. `DiffDialog.TryLockBodyAsync` /
`TryUnlockBodyAsync` each have a bare outer `catch` whose handler does another
JS call wrapped in a second bare `catch { /* swallow */ }`, and
`OnAfterRenderAsync`'s `_modalRef.FocusAsync()` is wrapped in a bare
`catch { /* prerender or detached: ignore */ }`. Genuine interop failures in
these paths are invisible in production logs.
**Recommendation**
Catch `JSDisconnectedException` silently and `JSException` (and
`InvalidOperationException` for the prerender focus case) with an `ILogger`
call, consistent with the CentralUI-018 fixes in the same module.
**Resolution**
_Unresolved._
### CentralUI-024 — Claim lookups use magic strings instead of `JwtTokenService` constants
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:102`; `src/ScadaLink.CentralUI/Components/Pages/Dashboard.razor:14`; `GetCurrentUserAsync` in `Templates.razor`, `TemplateEdit.razor`, `TemplateCreate.razor`, `SharedScripts.razor`, `SharedScriptForm.razor`, `Sites.razor`, `Topology.razor`, `InstanceCreate.razor`, `InstanceConfigure.razor` |
**Description**
`ScadaLink.Security.JwtTokenService` exposes the canonical claim-type constants
(`UsernameClaimType = "Username"`, `DisplayNameClaimType = "DisplayName"`,
`RoleClaimType`, `SiteIdClaimType`). `SiteScopeService` correctly uses
`JwtTokenService.SiteIdClaimType`, but every `GetCurrentUserAsync` helper across
ten pages does `authState.User.FindFirst("Username")?.Value`, and `NavMenu` /
`Dashboard` do `context.User.FindFirst("DisplayName")`. The literals happen to
match the constants today, so there is no live bug — but if the claim type is
ever renamed in `JwtTokenService` (the single source of truth) every one of
these call sites silently breaks, falling back to `"unknown"` for the audit
user and a blank display name. The duplicated `GetCurrentUserAsync` helper is
also copy-pasted verbatim into ten components.
**Recommendation**
Replace the string literals with `JwtTokenService.UsernameClaimType` /
`DisplayNameClaimType`. Consider extracting the repeated `GetCurrentUserAsync`
into a single shared helper (e.g. an extension on `AuthenticationStateProvider`
or a small scoped service) so the claim lookup lives in exactly one place.
**Resolution**
_Unresolved._
### CentralUI-025 — `SessionExpiry` polling/redirect path has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.CentralUI.Tests/Auth/SessionExpiryPolicyTests.cs`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor` |
**Description**
`SessionExpiryPolicyTests` covers only `AuthEndpoints.BuildSignInProperties()`
(the sign-in properties shape). The actual runtime behaviour of
`SessionExpiry.razor` — that an expired session triggers a redirect to
`/login`, that an authenticated session does not, and that the component does
not poll/redirect on the `/login` page itself — is untested. Had a behavioural
test exercised the redirect with an expired/anonymous auth state against the
real `CookieAuthenticationStateProvider`, the CentralUI-020 defect (the frozen
snapshot never reporting an expired session) would have been caught. The
component is the system's only client-side idle-logout mechanism, so the gap is
material.
**Recommendation**
Add bUnit tests for `SessionExpiry`: (a) with an unauthenticated auth state the
component navigates to `/login`; (b) with an authenticated state it does not;
(c) on the `/login` route it neither polls nor redirects. The provider used in
the test must be one whose state can actually transition to expired — which
also forces the CentralUI-020 fix.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ClusterInfrastructure` | | Module | `src/ScadaLink.ClusterInfrastructure` |
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` | | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-16 | | Last reviewed | 2026-05-17 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `9c60592` | | Commit reviewed | `39d737e` |
| Open findings | 0 | | Open findings | 2 |
## Summary ## Summary
@@ -29,20 +29,39 @@ every other component runs on, yet it presently delivers nothing the design requ
The single options class is clean and its test covers defaults and setters The single options class is clean and its test covers defaults and setters
adequately for what exists. adequately for what exists.
**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008)
were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions`
gained XML docs, the `SectionName` constant, and the `DownIfAlone` property;
`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the
validator and throws from the dead actor-registration method; and the test project
grew to 16 cases across three test classes. The module is in good shape — the
`ClusterOptions` contract, its validator, and the DI registration are all sound,
well-documented, and well-tested. This re-review examined all three source files and
all three test files against the full 10-category checklist and found **two new
issues**, both stemming from work the prior review explicitly deferred to a "Host
review" that has not happened: the `DownIfAlone` property is exposed and validated as
part of the configuration contract but is never consumed — `ScadaLink.Host`'s
`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
## Checklist coverage ## Checklist coverage
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
`39d737e`) are appended in each row.
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
|---|----------|----------|-------| |---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). | | 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). |
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. | | 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. |
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. | | 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. |
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. | | 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). |
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. | | 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. |
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. | | 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. |
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). | | 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). |
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. | | 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. |
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). | | 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). | | 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
## Findings ## Findings
@@ -490,3 +509,97 @@ section, and the component table reflects the true placement. This is a
documentation-only finding, so no runtime regression test is meaningful; verified by documentation-only finding, so no runtime regression test is meaningful; verified by
inspection of `ServiceCollectionExtensions.cs` and inspection of `ServiceCollectionExtensions.cs` and
`docs/requirements/Component-ClusterInfrastructure.md:21-39`. `docs/requirements/Component-ClusterInfrastructure.md:21-39`.
### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` |
**Description**
The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as
part of "the split-brain configuration contract". It is public, defaults to `true`,
carries an XML doc presenting it as "the design-doc requirement", and is exercised by
`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it.
The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`,
which **hard-codes** the resolver setting:
```
split-brain-resolver {
active-strategy = ...
stable-after = ...
keep-oldest {
down-if-alone = on
}
}
```
`BuildHocon` receives the full `ClusterOptions` instance and consumes every other
field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`,
`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely.
The result is a configuration property that an operator can set in `appsettings.json`,
that passes validation, and that has **zero runtime effect** — setting
`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly
acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host`
change ... noted for the Host's review") but the wiring was never done and no tracked
finding carried it, so the gap has silently persisted to commit `39d737e`. An inert,
misleadingly-documented configuration knob is a correctness and design-adherence
defect: it gives operators a false sense of control over a safety-critical resolver
setting.
**Recommendation**
Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`
— so the property does what its XML doc claims (a Host-side change, to be tracked in
the Host module's review since `BuildHocon` lives there), or (b) if the flag is
intentionally fixed at `on` and must never be operator-configurable, remove the
`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as
a non-negotiable invariant. Do not leave a public, settable, validated property that
controls nothing.
**Resolution**
_Unresolved._
### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` |
**Description**
`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest
resolver must be configured with `down-if-alone = on`, and the XML doc on
`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale
being that without it the oldest node can run as an isolated single-node cluster
during a partition while the younger node forms its own. `ClusterOptionsValidator`
guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only
strategy, positive timings, heartbeat below the failure threshold) but performs no
check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes
validation cleanly. This is currently latent because CI-009 shows the property is not
consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON
(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node
behaviour the design doc explicitly forbids — with no fail-fast guard. The validator
is the right place to enforce the invariant, consistent with how it already rejects
quorum split-brain strategies.
**Recommendation**
If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to
`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if
some future deployment legitimately needs it off, fails only in combination with the
`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster
hazard. If CI-009 is resolved by removing the property, this finding is moot and
should be closed as resolved alongside it.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Commons` | | Module | `src/ScadaLink.Commons` |
| Design doc | `docs/requirements/Component-Commons.md` | | Design doc | `docs/requirements/Component-Commons.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-16 | | Last reviewed | 2026-05-17 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `9c60592` | | Commit reviewed | `39d737e` |
| Open findings | 0 | | Open findings | 2 |
## Summary ## Summary
@@ -32,14 +32,28 @@ kind of edge-case logic that warrants them. Entity and message contracts otherwi
clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a
wire command. wire command.
**Re-review 2026-05-17 (commit `39d737e`).** All twelve prior findings (Commons-001
through Commons-012) are confirmed `Resolved` — the fixes are sound, well-targeted, and
backed by focused regression tests (`StaleTagMonitorRaceTests`, `DynamicJsonElementTests`,
`ScriptParametersTests`, `ManagementCommandRegistryTests`, `OpcUaEndpointConfigSerializerTests`,
`ResultTests`, `ValueFormatterTests`, `ConnectionBindingSerializationTests`,
`FlatteningAndScriptScopeTests`). The new files introduced since `9c60592`
(`TemplateAlarm` lock/inherit fields, `IExternalSystemRepository` name-keyed lookups,
`DeploymentStateQueryRequest`/`Response`, `ParameterDefinition`) follow the established
POCO / record / additive-evolution conventions and carry round-trip compatibility tests.
Two new Low-severity findings were recorded this pass: a `DynamicJsonElement` array
indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSerializer`
legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
No Critical, High, or Medium issues were found.
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
|---|----------|----------|-------| |---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. | | 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. Re-review: `DynamicJsonElement.TryGetIndex` rejects non-`int` indices (Commons-013). |
| 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. | | 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. |
| 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). | | 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). |
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). | | 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). Re-review: corrupt typed OPC UA rows can fall through to the legacy path and be mislabelled `Legacy` (Commons-014). |
| 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. | | 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. |
| 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). | | 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). |
| 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. | | 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. |
@@ -566,3 +580,75 @@ the parameterless `ToString()`). The XML doc gained a remarks block stating the
culture-invariant contract and why. Regression tests added in `ValueFormatterTests` culture-invariant contract and why. Regression tests added in `ValueFormatterTests`
(`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`, (`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`,
each pinned under `de-DE`). each pinned under `de-DE`).
### Commons-013 — `DynamicJsonElement.TryGetIndex` rejects non-`int` index values
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:40-54` |
**Description**
`TryGetIndex` accepts an index only when `indexes[0] is int index`. `DynamicJsonElement`
is designed for dynamic access from scripts (`obj.items[0]`). In a `dynamic` expression the
index operand's runtime type follows the script's variable type — a script that computes
an index in a loop counter or reads it from another `DynamicJsonElement` (whose numbers
are unwrapped as `long` by `Wrap`, see `:105`) will pass a `long`, not an `int`. The
pattern match then fails, `TryGetIndex` returns `false`, and the dynamic binder throws a
`RuntimeBinderException` for what is a perfectly valid in-range index. Because the wrapper
itself surfaces JSON numbers as `long`, `obj.items[obj.count - 1]` — count being a wrapped
JSON number — is the exact failing case. The `int`-only guard also silently rejects
`byte`/`short` indices that would widen to a valid array position.
**Recommendation**
Accept any integral index by converting through `Convert.ToInt64` (guarded for
`OverflowException`) or by matching `int`, `long`, `short`, `byte` and normalizing to a
single integer before the bounds check. Add a regression test indexing with a `long`.
**Resolution**
_Unresolved._
### Commons-014 — `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:107-131` |
**Description**
`Deserialize` tries the typed path first: it parses the document, checks for an
`endpointUrl` property, then calls `JsonSerializer.Deserialize<OpcUaEndpointConfig>`.
The whole block is wrapped in `catch (JsonException) { /* fall through to legacy */ }`.
If a row *is* the current typed shape (it has `endpointUrl`) but is corrupt in a way that
makes `JsonSerializer.Deserialize` throw a `JsonException` — e.g. an enum-valued field
holding an unrecognised string, or a numeric field holding a non-numeric token — the
exception is swallowed and control falls through to `LoadLegacy`. `LoadLegacy` only
requires the root to be a JSON object, so it will usually succeed against the same input
and the result is reported as `OpcUaConfigParseStatus.Legacy`. The Commons-005 fix added
the `Malformed` status precisely so a caller can tell a recoverable legacy row from
unparseable data; this path re-introduces a softer version of the same confusion — a
genuinely broken current-shape row is presented to the user as a benign "please re-save"
legacy row, and the offending field is silently dropped by `FromFlatDict` (which ignores
keys it cannot parse) rather than surfaced. The XML doc describes the legacy fallback as
being for "pre-refactor rows" only and does not mention this branch.
**Recommendation**
Only fall through to `LoadLegacy` when the typed shape is genuinely *not present* — i.e.
the `endpointUrl` property is absent. When `endpointUrl` *is* present but typed
deserialization throws, classify the outcome as `Malformed` (or a distinct status) so the
caller can surface a real error instead of an empty/partial config. Tighten the XML doc
to describe this branch, and add a regression test for a typed row with an invalid enum
field.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Communication` | | Module | `src/ScadaLink.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` | | Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed | | Status | Reviewed |
| Last reviewed | 2026-05-16 | | Last reviewed | 2026-05-17 |
| Reviewer | claude-agent | | Reviewer | claude-agent |
| Commit reviewed | `9c60592` | | Commit reviewed | `39d737e` |
| Open findings | 0 | | Open findings | 4 |
## Summary ## Summary
@@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe
good. Test coverage is broad for happy paths but has gaps around failover, cache good. Test coverage is broad for happy paths but has gaps around failover, cache
mutation races, and the snapshot-timeout cleanup path. mutation races, and the snapshot-timeout cleanup path.
#### Re-review 2026-05-17 (commit `39d737e`)
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
and the fixes hold up against the source. The re-review walked all 10 checklist
categories again and uncovered a previously-missed defect at the centre of the gRPC
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
node's endpoint, but the factory hands back the original NodeA-bound client every
time — so the documented "try the other site node endpoint" failover never actually
moves to NodeB (Communication-012). The same caching defect means a site address
change is never picked up because `RemoveSiteAsync` has no production caller
(Communication-013). Two Low findings round out the re-review: an untrusted
gRPC-supplied `correlation_id` flows straight into an Akka actor name
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
## Checklist coverage ## Checklist coverage
| # | Category | Examined | Notes | | # | Category | Examined | Notes |
|---|----------|----------|-------| |---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. | | 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. | | 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. | | 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. | | 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). | | 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. | | 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. | | 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. | | 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. | | 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. | | 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
## Findings ## Findings
@@ -519,3 +536,151 @@ passes after):
(added with this finding's resolution). (added with this finding's resolution).
The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at
111 passing tests. 111 passing tests.
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
**Description**
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
the *other* node. Because the factory ignores the new endpoint, the bridge actor
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
but it unsubscribes and re-subscribes on the *same* client/node rather than the
intended other node.)
**Recommendation**
Make the per-site client aware of both endpoints, or key the cache by
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
targets the other endpoint.
**Resolution**
_Unresolved._
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
**Description**
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
a repo-wide search finds **no production caller** — only tests invoke it. Combined
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
in the cached client — it keeps using the stale channel for the life of the process.
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
recreates ClusterClients on change, but there is no equivalent invalidation path
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
an initial misconfiguration will never have working debug streaming until the central
node is restarted.
**Recommendation**
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory`
e.g. have `CentralCommunicationActor` (which already detects address changes in
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
the on-the-fly address-change requirement is intentionally dropped, remove
`RemoveSiteAsync` and correct the design doc.
**Resolution**
_Unresolved._
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
**Description**
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
already been set up for the duration, though the `finally` does not run because the
throw is before the `try`). In practice central always supplies a GUID, so impact is
low, but the server is trusting client-supplied input to be actor-name-safe.
**Recommendation**
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
name solely from the internal `_actorCounter` and keep the correlation ID only as
actor state / dictionary key.
**Resolution**
_Unresolved._
### Communication-015 — No test exercises the real gRPC client factory across a node flip
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
**Description**
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
`GetOrCreate` returns the same client for the same site — it never checks what happens
when the same site is requested with a *different* endpoint.
**Recommendation**
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
**Resolution**
_Unresolved._

View File

@@ -40,20 +40,20 @@ module file and counted in **Total**.
| Severity | Open findings | | Severity | Open findings |
|----------|---------------| |----------|---------------|
| Critical | 0 | | Critical | 0 |
| High | 0 | | High | 2 |
| Medium | 0 | | Medium | 5 |
| Low | 0 | | Low | 10 |
| **Total** | **0** | | **Total** | **17** |
## Module Status ## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total | | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------| |--------|---------------|--------|----------------|------|-------|
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | | [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 16 |
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 | | [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/3 | 6 | 25 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 8 | | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 | | [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | | [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | | [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 | | [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
@@ -80,14 +80,34 @@ description, location, recommendation — lives in the module's `findings.md`.
_None open._ _None open._
### High (0) ### High (2)
_None open._ | ID | Module | Title |
|----|--------|-------|
| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
### Medium (0) ### Medium (5)
_None open._ | ID | Module | Title |
|----|--------|-------|
| CLI-014 | [CLI](CLI/findings.md) | `update` commands require "core" fields, making partial updates impossible |
| CentralUI-021 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callback mutates `Dictionary` off the render thread |
| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
### Low (0) ### Low (10)
_None open._ | ID | Module | Title |
|----|--------|-------|
| CLI-015 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface has drifted again in two places |
| CLI-016 | [CLI](CLI/findings.md) | `WriteAsTable` derives columns from the first array element only |
| CentralUI-023 | [CentralUI](CentralUI/findings.md) | Residual bare `catch {}` blocks swallow JS interop errors |
| CentralUI-024 | [CentralUI](CentralUI/findings.md) | Claim lookups use magic strings instead of `JwtTokenService` constants |
| CentralUI-025 | [CentralUI](CentralUI/findings.md) | `SessionExpiry` polling/redirect path has no test coverage |
| ClusterInfrastructure-010 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator does not enforce `DownIfAlone = true` despite the design doc requiring it |
| Commons-013 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryGetIndex` rejects non-`int` index values |
| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |