docs(code-reviews): re-review batch 1 at 39d737e — CentralUI, CLI, ClusterInfrastructure, Commons, Communication
17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.CLI` |
|
||||
| Design doc | `docs/requirements/Component-CLI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 3 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -31,8 +31,26 @@ ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliCon
|
||||
`CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON
|
||||
argument parsing, and the command-tree wiring are untested.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 13 prior findings are confirmed resolved — the resilience gaps, the dead format
|
||||
configuration, the credential-handling weakness, and the test-coverage holes have all
|
||||
been closed, and the test suite has grown substantially (`CommandTreeTests`,
|
||||
`ManagementHttpClientTests`, `DebugStreamTests`, etc.). The CLI's runtime behaviour is now
|
||||
solid. This re-review walked all 14 command groups against the full checklist and found
|
||||
three new issues, all rooted in **update-command and design-document drift** rather than
|
||||
runtime defects: every `update` command requires the entity's "core" fields (`--name`,
|
||||
`--script`) even though the design doc presents them as optional, so a partial update is
|
||||
impossible (CLI-014); the design doc's command surface has drifted again in two specific
|
||||
places — `template composition delete` and the `data-connection` config flags (CLI-015);
|
||||
and `WriteAsTable` derives table columns from only the first array element, silently
|
||||
dropping columns for any later element with a different shape (CLI-016). No
|
||||
Critical/High issues; the module remains healthy.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
_Original review (2026-05-16, `9c60592`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). |
|
||||
@@ -46,6 +64,21 @@ argument parsing, and the command-tree wiring are untested.
|
||||
| 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). |
|
||||
| 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). |
|
||||
|
||||
_Re-review (2026-05-17, `39d737e`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Update commands require "core" fields, blocking partial updates (CLI-014); `WriteAsTable` headers derived from first array element only (CLI-016). |
|
||||
| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR client. No issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | `debug stream` concurrency now resolved via `DebugStreamHelpers` (CLI-011/012). No new issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Malformed-URL / malformed-JSON / connect-cancellation paths all hardened (CLI-004/005/010). No new issues. |
|
||||
| 5 | Security | ☑ | Env-var credential fallback in place (CLI-006). Basic Auth over HTTP is by design. No new issues. |
|
||||
| 6 | Performance & resource management | ☑ | `CancellationTokenSource` now `using`-scoped (CLI-011). No new issues. |
|
||||
| 7 | Design-document adherence | ☑ | Two residual command-surface drifts: `template composition delete` and `data-connection --primary-config` (CLI-015). |
|
||||
| 8 | Code organization & conventions | ☑ | Consistent and clean; option construction centralised in `CliOptions`. No new issues. |
|
||||
| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. |
|
||||
| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. |
|
||||
|
||||
## Findings
|
||||
|
||||
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
|
||||
@@ -537,3 +570,121 @@ Resolved 2026-05-16 (commit pending). Coverage gaps confirmed and closed:
|
||||
`InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by
|
||||
`InstanceArgumentParsingTests` under CLI-005.
|
||||
The CLI test suite went from 42 to 77 passing tests.
|
||||
|
||||
### CLI-014 — `update` commands require "core" fields, making partial updates impossible
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/TemplateCommands.cs:77`, `src/ScadaLink.CLI/Commands/SiteCommands.cs:86`, `src/ScadaLink.CLI/Commands/ExternalSystemCommands.cs:40-42`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:39-40`, `src/ScadaLink.CLI/Commands/NotificationCommands.cs:40-41`, `src/ScadaLink.CLI/Commands/ApiMethodCommands.cs:79` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc presents `update` commands with all non-`--id` fields as optional, e.g.
|
||||
`template update --id <id> [--name <name>] [--description <desc>] [--parent-id <id>]`
|
||||
(`Component-CLI.md:62`) and `api-method update --id <id> [--script <code>] ...`
|
||||
(`Component-CLI.md:224`). The implementation contradicts this: the update commands mark
|
||||
the entity's "core" fields as `Required = true`, so the user must always re-supply them:
|
||||
|
||||
- `template update` — `--name` is `Required = true` (`TemplateCommands.cs:77`).
|
||||
- `site update` — `--name` is `Required = true` (`SiteCommands.cs:86`).
|
||||
- `external-system update` — `--name`, `--endpoint-url`, `--auth-type` are all
|
||||
`Required = true` (`ExternalSystemCommands.cs:40-42`).
|
||||
- `data-connection update` — `--name`, `--protocol` are `Required = true`
|
||||
(`DataConnectionCommands.cs:39-40`).
|
||||
- `notification update` — `--name`, `--emails` are `Required = true`
|
||||
(`NotificationCommands.cs:40-41`).
|
||||
- `api-method update` — `--script` is `Required = true` (`ApiMethodCommands.cs:79`).
|
||||
- The same pattern applies to `template attribute/alarm/script update`
|
||||
(`TemplateCommands.cs:164-165`, `246-248`, `332-334`) and `role-mapping update`
|
||||
(`SecurityCommands.cs:110-111`).
|
||||
|
||||
Because the corresponding `Update*Command` records are whole-replace (they carry the full
|
||||
field set, not a sparse patch), a user who wants to change only one field — e.g. flip an
|
||||
API method's timeout, or rename a template — must look up and re-pass every other field's
|
||||
current value. Omitting any required flag is a hard parse error. This makes scripted,
|
||||
single-field updates (a core CLI/CI use case) awkward and error-prone, and it does not
|
||||
match the documented optional-flag surface.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide on one model and align doc + code. Either (a) make the update flags genuinely
|
||||
optional and have the server/`Update*Command` treat a null field as "leave unchanged"
|
||||
(sparse patch), or (b) if whole-replace is intentional, update `Component-CLI.md` to show
|
||||
these flags as required (no `[...]`) and document that an update replaces the whole
|
||||
entity. Option (a) matches the documented surface and the typical CLI expectation.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-015 — `Component-CLI.md` command surface has drifted again in two places
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-CLI.md:75`, `docs/requirements/Component-CLI.md:125-126` (vs. `src/ScadaLink.CLI/Commands/TemplateCommands.cs:404-413`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:41`, `:86`) |
|
||||
|
||||
**Description**
|
||||
|
||||
CLI-007 regenerated the doc's "Command Structure" section, but two specific drifts remain
|
||||
or were introduced:
|
||||
|
||||
- `Component-CLI.md:75` documents `template composition delete --template-id <id>
|
||||
--instance-name <name>`, but the implementation (`TemplateCommands.cs:404-413`) deletes
|
||||
a composition by its own integer ID via a single `--id` option
|
||||
(`DeleteTemplateCompositionCommand(id)`). The doc's two-flag form does not exist.
|
||||
- `data-connection create` and `update` accept a `--primary-config` option (aliased
|
||||
`--configuration`) for the primary configuration JSON (`DataConnectionCommands.cs:86`,
|
||||
`:41`), but `Component-CLI.md:125-126` lists only `--backup-config` and
|
||||
`--failover-retry-count` — the primary-config flag is absent from the doc.
|
||||
|
||||
A reader following the doc would use a non-existent `template composition delete` form and
|
||||
would not discover the `--primary-config` flag.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Correct `Component-CLI.md:75` to `template composition delete --id <id>`, and add
|
||||
`[--primary-config <json>]` to the documented `data-connection create`/`update` signatures
|
||||
(`Component-CLI.md:125-126`). Also note the `--configuration` alias if aliases are
|
||||
documented elsewhere.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CLI-016 — `WriteAsTable` derives columns from the first array element only
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:184-200` |
|
||||
|
||||
**Description**
|
||||
|
||||
When rendering a JSON array as a table, `WriteAsTable` builds the header set from
|
||||
`items[0].EnumerateObject()` only (`CommandHelpers.cs:184-186`) and then projects every
|
||||
row against that fixed header list (`:188-198`). If a later element of the array has a
|
||||
different shape — additional properties, or properties the first element lacks — those
|
||||
extra columns are silently dropped from the table and a row missing a header property
|
||||
renders an empty cell. The user sees a table that appears complete but has omitted data,
|
||||
with no indication that columns were discarded. (The JSON output path is unaffected; this
|
||||
only affects `--format table`.) Management API list responses are generally homogeneous,
|
||||
so the practical impact is low, but a heterogeneous array — e.g. a diff or a mixed-status
|
||||
list — would be rendered incorrectly with no warning.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Compute the header set as the union of property names across all array elements (iterate
|
||||
all items, collect distinct property names preserving first-seen order) before projecting
|
||||
rows, so no element's data is silently dropped.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.CentralUI` |
|
||||
| Design doc | `docs/requirements/Component-CentralUI.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 6 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -32,6 +32,24 @@ Testing coverage is thin for a module this large: only the script analyzer,
|
||||
TreeView, schema model, and a few data-connection pages have unit tests; most
|
||||
pages and the auth bridge are untested.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All 19 findings from the 2026-05-16 review are confirmed closed. The resolution
|
||||
batch (`a9bd7ee`..`34588ae`) substantially rewrote the auth bridge, the script
|
||||
sandbox, several Deployment/Monitoring pages, and the shared component disposal
|
||||
paths, so this re-review re-examined the post-fix state across all 10 checklist
|
||||
categories. Six new findings (CentralUI-020 .. 025) were recorded. The most
|
||||
important is **CentralUI-020**: the two prior fixes interact destructively — the
|
||||
CentralUI-004 fix made `CookieAuthenticationStateProvider` return a frozen,
|
||||
constructor-time auth-state snapshot, while the CentralUI-005 fix rewrote
|
||||
`SessionExpiry.razor` to *poll* that same provider to detect a lapsed session.
|
||||
Because the snapshot never changes for the life of the circuit, the idle-timeout
|
||||
redirect can never fire, so the documented idle-logout behaviour is silently
|
||||
defeated. The remaining new findings are a cross-thread `Dictionary` mutation in
|
||||
`DebugView`, an unguarded `InvokeAsync` in the new `Deployments` push handler,
|
||||
and three Low-severity items (residual bare `catch`, magic-string claim
|
||||
lookups, and the untested `SessionExpiry` polling path).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -47,6 +65,14 @@ pages and the auth bridge are untested.
|
||||
| 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. |
|
||||
| 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. |
|
||||
|
||||
Re-review 2026-05-17 (`39d737e`): all 10 categories re-examined against the
|
||||
post-fix source. New findings — category 3 (CentralUI-020 the auth-snapshot vs
|
||||
session-poll interaction is also a design-adherence regression; CentralUI-021
|
||||
cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4
|
||||
(CentralUI-023 residual bare `catch`), category 8 (CentralUI-024 magic-string
|
||||
claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories
|
||||
1, 2, 5, 6, 7, 10 produced no new findings.
|
||||
|
||||
## Findings
|
||||
|
||||
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
|
||||
@@ -961,3 +987,232 @@ it was failing on the baseline because `DeploymentService` had gained a
|
||||
`DiffService` constructor dependency from a DeploymentManager contract change
|
||||
that the test fixture had not been updated for; `DiffService` is now registered
|
||||
in the fixture.)
|
||||
|
||||
### CentralUI-020 — Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:39-62`; `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:29-43` |
|
||||
|
||||
**Description**
|
||||
|
||||
The CentralUI-004 fix and the CentralUI-005 fix interact destructively.
|
||||
CentralUI-004 made `CookieAuthenticationStateProvider` snapshot the principal
|
||||
**once** in its constructor into a cached `Task<AuthenticationState>` and serve
|
||||
that exact task for the entire life of the SignalR circuit — it never re-reads
|
||||
`HttpContext`, never calls `SetAuthenticationState`, and never raises
|
||||
`NotifyAuthenticationStateChanged`. CentralUI-005 then rewrote
|
||||
`SessionExpiry.razor` to *poll* `AuthStateProvider.GetAuthenticationStateAsync()`
|
||||
once a minute and redirect to `/login` "once the sliding cookie has actually
|
||||
lapsed server-side." But `GetAuthenticationStateAsync()` returns the same frozen
|
||||
constructor-time snapshot on every call — `auth.User.Identity.IsAuthenticated`
|
||||
is permanently `true` for the life of the circuit regardless of whether the
|
||||
server-side cookie has expired. The poll loop therefore never observes an
|
||||
expired session and the redirect never fires. An idle user whose cookie has
|
||||
lapsed server-side keeps an authenticated-looking page open indefinitely; the
|
||||
documented "30-minute idle timeout" is silently defeated for any user who
|
||||
leaves a circuit open. (The cookie middleware would still reject the *next*
|
||||
full HTTP request / new circuit, so this is a stale-UI / missed-logout exposure
|
||||
rather than a full auth bypass — but the page continues to render
|
||||
authenticated content and a SignalR circuit can stay alive for a long time.)
|
||||
This is also a design-document-adherence regression against CLAUDE.md
|
||||
"Security & Auth" (30-minute idle timeout) — recorded under Concurrency because
|
||||
the root cause is the lifetime/staleness mismatch between the two components.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
`SessionExpiry` must consult something that actually reflects the live
|
||||
server-side session, not the circuit's frozen principal. Options: (a) have
|
||||
`SessionExpiry` poll a lightweight authenticated server endpoint (e.g. a
|
||||
`/auth/ping` minimal API that returns 401 once the cookie has lapsed) and
|
||||
redirect on 401; or (b) give `CookieAuthenticationStateProvider` a refresh path
|
||||
that re-validates the cookie and calls `SetAuthenticationState` /
|
||||
`NotifyAuthenticationStateChanged` so the polled state can actually change.
|
||||
Whichever is chosen, add a test that exercises the redirect path with an
|
||||
expired session (see CentralUI-025).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-021 — `DebugView` stream callback mutates `Dictionary` off the render thread
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:404-419,511-519,275-289` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `onEvent` callback passed to `DebugStreamService.StartStreamAsync` runs on
|
||||
an Akka/gRPC thread (as the design doc and the CentralUI-009 comments state). It
|
||||
calls `UpsertWithCap(_attributeValues, …)` / `UpsertWithCap(_alarmStates, …)`
|
||||
**directly on that thread** — the mutation is not marshalled through
|
||||
`InvokeAsync`; only the subsequent `StateHasChanged` is. Meanwhile the render
|
||||
thread evaluates `FilteredAttributeValues` / `FilteredAlarmStates`, which
|
||||
enumerate `_attributeValues.Values` / `_alarmStates.Values` and call
|
||||
`OrderBy(...).ToList()`. `Dictionary<TKey,TValue>` is not thread-safe: a write
|
||||
on the Akka thread concurrent with an enumeration on the render thread can throw
|
||||
`InvalidOperationException` ("Collection was modified; enumeration operation may
|
||||
not execute") or corrupt the dictionary's internal buckets. The CentralUI-009
|
||||
fix added a `_disposed` guard but did not address this data race — the guard
|
||||
only prevents touching a *disposed* component, not concurrent access to a live
|
||||
one. Under a busy debug stream this will intermittently fault the page.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Marshal the dictionary mutation onto the render thread too — move the
|
||||
`UpsertWithCap` call inside the `SafeInvokeAsync`/`InvokeAsync` body so all
|
||||
access to `_attributeValues`/`_alarmStates` happens on the renderer's
|
||||
dispatcher. Alternatively guard both the writes and the `Filtered*` reads with a
|
||||
lock, or use a concurrent collection. The cap-trim loop must be inside the same
|
||||
critical section as the upsert.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-022 — `Deployments` push handler fires `InvokeAsync` with no disposal guard
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:221-229,317-322` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OnDeploymentStatusChanged` is invoked by `IDeploymentStatusNotifier`, a process
|
||||
singleton, on the DeploymentManager service thread. The handler does
|
||||
`_ = InvokeAsync(async () => { await LoadDataAsync(); StateHasChanged(); })`,
|
||||
discarding the returned task. `Dispose()` unsubscribes the handler, but there is
|
||||
a race window: the notifier can read the subscriber list and begin invoking
|
||||
`OnDeploymentStatusChanged` *just before* the component is disposed, so
|
||||
`InvokeAsync` then runs against a disposed component and throws
|
||||
`ObjectDisposedException` on the DeploymentManager thread — an unobserved task
|
||||
exception (the task is fire-and-forget). The same hazard was explicitly fixed
|
||||
for `DebugView` (CentralUI-009, `SafeInvokeAsync` + `_disposed` flag) and
|
||||
`ToastNotification` (CentralUI-010), but the new push-based `Deployments`
|
||||
handler introduced by the CentralUI-006 fix did not adopt the same guard.
|
||||
Separately, every push event triggers two full repository reloads
|
||||
(`GetAllInstancesAsync` + `GetAllDeploymentRecordsAsync`) for every open
|
||||
circuit, so a burst of status changes amplifies into N×2 round-trips per tick.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `volatile bool _disposed` set first in `Dispose()`, have
|
||||
`OnDeploymentStatusChanged` no-op when set, and wrap the `InvokeAsync` dispatch
|
||||
to swallow `ObjectDisposedException` (mirror `DebugView.SafeInvokeAsync`).
|
||||
Optionally coalesce bursts (debounce) and/or reload only the changed record
|
||||
rather than the whole table on each event.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-023 — Residual bare `catch {}` blocks swallow JS interop errors
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:690-698`; `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:107-116,118-130,104` |
|
||||
|
||||
**Description**
|
||||
|
||||
CentralUI-018 narrowed the bare `catch {}` blocks in `MonacoEditor`,
|
||||
`TreeView`, and `Sites.razor`, but the same pattern survives elsewhere.
|
||||
`ParkedMessages.CopyAsync` wraps `navigator.clipboard.writeText` in
|
||||
`catch { _toast.ShowError("Copy failed."); }` — a real `JSException`
|
||||
(clipboard permission denied) and an expected `JSDisconnectedException` are
|
||||
treated identically and neither is logged. `DiffDialog.TryLockBodyAsync` /
|
||||
`TryUnlockBodyAsync` each have a bare outer `catch` whose handler does another
|
||||
JS call wrapped in a second bare `catch { /* swallow */ }`, and
|
||||
`OnAfterRenderAsync`'s `_modalRef.FocusAsync()` is wrapped in a bare
|
||||
`catch { /* prerender or detached: ignore */ }`. Genuine interop failures in
|
||||
these paths are invisible in production logs.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Catch `JSDisconnectedException` silently and `JSException` (and
|
||||
`InvalidOperationException` for the prerender focus case) with an `ILogger`
|
||||
call, consistent with the CentralUI-018 fixes in the same module.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-024 — Claim lookups use magic strings instead of `JwtTokenService` constants
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:102`; `src/ScadaLink.CentralUI/Components/Pages/Dashboard.razor:14`; `GetCurrentUserAsync` in `Templates.razor`, `TemplateEdit.razor`, `TemplateCreate.razor`, `SharedScripts.razor`, `SharedScriptForm.razor`, `Sites.razor`, `Topology.razor`, `InstanceCreate.razor`, `InstanceConfigure.razor` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ScadaLink.Security.JwtTokenService` exposes the canonical claim-type constants
|
||||
(`UsernameClaimType = "Username"`, `DisplayNameClaimType = "DisplayName"`,
|
||||
`RoleClaimType`, `SiteIdClaimType`). `SiteScopeService` correctly uses
|
||||
`JwtTokenService.SiteIdClaimType`, but every `GetCurrentUserAsync` helper across
|
||||
ten pages does `authState.User.FindFirst("Username")?.Value`, and `NavMenu` /
|
||||
`Dashboard` do `context.User.FindFirst("DisplayName")`. The literals happen to
|
||||
match the constants today, so there is no live bug — but if the claim type is
|
||||
ever renamed in `JwtTokenService` (the single source of truth) every one of
|
||||
these call sites silently breaks, falling back to `"unknown"` for the audit
|
||||
user and a blank display name. The duplicated `GetCurrentUserAsync` helper is
|
||||
also copy-pasted verbatim into ten components.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Replace the string literals with `JwtTokenService.UsernameClaimType` /
|
||||
`DisplayNameClaimType`. Consider extracting the repeated `GetCurrentUserAsync`
|
||||
into a single shared helper (e.g. an extension on `AuthenticationStateProvider`
|
||||
or a small scoped service) so the claim lookup lives in exactly one place.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### CentralUI-025 — `SessionExpiry` polling/redirect path has no test coverage
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.CentralUI.Tests/Auth/SessionExpiryPolicyTests.cs`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SessionExpiryPolicyTests` covers only `AuthEndpoints.BuildSignInProperties()`
|
||||
(the sign-in properties shape). The actual runtime behaviour of
|
||||
`SessionExpiry.razor` — that an expired session triggers a redirect to
|
||||
`/login`, that an authenticated session does not, and that the component does
|
||||
not poll/redirect on the `/login` page itself — is untested. Had a behavioural
|
||||
test exercised the redirect with an expired/anonymous auth state against the
|
||||
real `CookieAuthenticationStateProvider`, the CentralUI-020 defect (the frozen
|
||||
snapshot never reporting an expired session) would have been caught. The
|
||||
component is the system's only client-side idle-logout mechanism, so the gap is
|
||||
material.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add bUnit tests for `SessionExpiry`: (a) with an unauthenticated auth state the
|
||||
component navigates to `/login`; (b) with an authenticated state it does not;
|
||||
(c) on the `/login` route it neither polls nor redirects. The provider used in
|
||||
the test must be one whose state can actually transition to expired — which
|
||||
also forces the CentralUI-020 fix.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.ClusterInfrastructure` |
|
||||
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -29,20 +29,39 @@ every other component runs on, yet it presently delivers nothing the design requ
|
||||
The single options class is clean and its test covers defaults and setters
|
||||
adequately for what exists.
|
||||
|
||||
**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008)
|
||||
were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions`
|
||||
gained XML docs, the `SectionName` constant, and the `DownIfAlone` property;
|
||||
`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the
|
||||
validator and throws from the dead actor-registration method; and the test project
|
||||
grew to 16 cases across three test classes. The module is in good shape — the
|
||||
`ClusterOptions` contract, its validator, and the DI registration are all sound,
|
||||
well-documented, and well-tested. This re-review examined all three source files and
|
||||
all three test files against the full 10-category checklist and found **two new
|
||||
issues**, both stemming from work the prior review explicitly deferred to a "Host
|
||||
review" that has not happened: the `DownIfAlone` property is exposed and validated as
|
||||
part of the configuration contract but is never consumed — `ScadaLink.Host`'s
|
||||
`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator
|
||||
does not enforce the design doc's requirement that `down-if-alone` be `on` for the
|
||||
keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17,
|
||||
`39d737e`) are appended in each row.
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
|
||||
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
|
||||
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
|
||||
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
|
||||
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
|
||||
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
|
||||
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
|
||||
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
|
||||
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
|
||||
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). |
|
||||
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. |
|
||||
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). |
|
||||
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. |
|
||||
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. |
|
||||
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). |
|
||||
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. |
|
||||
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. |
|
||||
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -490,3 +509,97 @@ section, and the component table reflects the true placement. This is a
|
||||
documentation-only finding, so no runtime regression test is meaningful; verified by
|
||||
inspection of `ServiceCollectionExtensions.cs` and
|
||||
`docs/requirements/Component-ClusterInfrastructure.md:21-39`.
|
||||
|
||||
### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as
|
||||
part of "the split-brain configuration contract". It is public, defaults to `true`,
|
||||
carries an XML doc presenting it as "the design-doc requirement", and is exercised by
|
||||
`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it.
|
||||
The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`,
|
||||
which **hard-codes** the resolver setting:
|
||||
|
||||
```
|
||||
split-brain-resolver {
|
||||
active-strategy = ...
|
||||
stable-after = ...
|
||||
keep-oldest {
|
||||
down-if-alone = on
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`BuildHocon` receives the full `ClusterOptions` instance and consumes every other
|
||||
field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`,
|
||||
`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely.
|
||||
The result is a configuration property that an operator can set in `appsettings.json`,
|
||||
that passes validation, and that has **zero runtime effect** — setting
|
||||
`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly
|
||||
acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host`
|
||||
change ... noted for the Host's review") but the wiring was never done and no tracked
|
||||
finding carried it, so the gap has silently persisted to commit `39d737e`. An inert,
|
||||
misleadingly-documented configuration knob is a correctness and design-adherence
|
||||
defect: it gives operators a false sense of control over a safety-critical resolver
|
||||
setting.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}`
|
||||
— so the property does what its XML doc claims (a Host-side change, to be tracked in
|
||||
the Host module's review since `BuildHocon` lives there), or (b) if the flag is
|
||||
intentionally fixed at `on` and must never be operator-configurable, remove the
|
||||
`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as
|
||||
a non-negotiable invariant. Do not leave a public, settable, validated property that
|
||||
controls nothing.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest
|
||||
resolver must be configured with `down-if-alone = on`, and the XML doc on
|
||||
`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale
|
||||
being that without it the oldest node can run as an isolated single-node cluster
|
||||
during a partition while the younger node forms its own. `ClusterOptionsValidator`
|
||||
guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only
|
||||
strategy, positive timings, heartbeat below the failure threshold) but performs no
|
||||
check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes
|
||||
validation cleanly. This is currently latent because CI-009 shows the property is not
|
||||
consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON
|
||||
(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node
|
||||
behaviour the design doc explicitly forbids — with no fail-fast guard. The validator
|
||||
is the right place to enforce the invariant, consistent with how it already rejects
|
||||
quorum split-brain strategies.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to
|
||||
`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if
|
||||
some future deployment legitimately needs it off, fails only in combination with the
|
||||
`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster
|
||||
hazard. If CI-009 is resolved by removing the property, this finding is moot and
|
||||
should be closed as resolved alongside it.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Commons` |
|
||||
| Design doc | `docs/requirements/Component-Commons.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 2 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -32,14 +32,28 @@ kind of edge-case logic that warrants them. Entity and message contracts otherwi
|
||||
clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a
|
||||
wire command.
|
||||
|
||||
**Re-review 2026-05-17 (commit `39d737e`).** All twelve prior findings (Commons-001
|
||||
through Commons-012) are confirmed `Resolved` — the fixes are sound, well-targeted, and
|
||||
backed by focused regression tests (`StaleTagMonitorRaceTests`, `DynamicJsonElementTests`,
|
||||
`ScriptParametersTests`, `ManagementCommandRegistryTests`, `OpcUaEndpointConfigSerializerTests`,
|
||||
`ResultTests`, `ValueFormatterTests`, `ConnectionBindingSerializationTests`,
|
||||
`FlatteningAndScriptScopeTests`). The new files introduced since `9c60592`
|
||||
(`TemplateAlarm` lock/inherit fields, `IExternalSystemRepository` name-keyed lookups,
|
||||
`DeploymentStateQueryRequest`/`Response`, `ParameterDefinition`) follow the established
|
||||
POCO / record / additive-evolution conventions and carry round-trip compatibility tests.
|
||||
Two new Low-severity findings were recorded this pass: a `DynamicJsonElement` array
|
||||
indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSerializer`
|
||||
legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014).
|
||||
No Critical, High, or Medium issues were found.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. |
|
||||
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. Re-review: `DynamicJsonElement.TryGetIndex` rejects non-`int` indices (Commons-013). |
|
||||
| 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). |
|
||||
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). |
|
||||
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). Re-review: corrupt typed OPC UA rows can fall through to the legacy path and be mislabelled `Legacy` (Commons-014). |
|
||||
| 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. |
|
||||
| 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). |
|
||||
| 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. |
|
||||
@@ -566,3 +580,75 @@ the parameterless `ToString()`). The XML doc gained a remarks block stating the
|
||||
culture-invariant contract and why. Regression tests added in `ValueFormatterTests`
|
||||
(`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`,
|
||||
each pinned under `de-DE`).
|
||||
|
||||
### Commons-013 — `DynamicJsonElement.TryGetIndex` rejects non-`int` index values
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:40-54` |
|
||||
|
||||
**Description**
|
||||
|
||||
`TryGetIndex` accepts an index only when `indexes[0] is int index`. `DynamicJsonElement`
|
||||
is designed for dynamic access from scripts (`obj.items[0]`). In a `dynamic` expression the
|
||||
index operand's runtime type follows the script's variable type — a script that computes
|
||||
an index in a loop counter or reads it from another `DynamicJsonElement` (whose numbers
|
||||
are unwrapped as `long` by `Wrap`, see `:105`) will pass a `long`, not an `int`. The
|
||||
pattern match then fails, `TryGetIndex` returns `false`, and the dynamic binder throws a
|
||||
`RuntimeBinderException` for what is a perfectly valid in-range index. Because the wrapper
|
||||
itself surfaces JSON numbers as `long`, `obj.items[obj.count - 1]` — count being a wrapped
|
||||
JSON number — is the exact failing case. The `int`-only guard also silently rejects
|
||||
`byte`/`short` indices that would widen to a valid array position.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Accept any integral index by converting through `Convert.ToInt64` (guarded for
|
||||
`OverflowException`) or by matching `int`, `long`, `short`, `byte` and normalizing to a
|
||||
single integer before the bounds check. Add a regression test indexing with a `long`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Commons-014 — `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:107-131` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Deserialize` tries the typed path first: it parses the document, checks for an
|
||||
`endpointUrl` property, then calls `JsonSerializer.Deserialize<OpcUaEndpointConfig>`.
|
||||
The whole block is wrapped in `catch (JsonException) { /* fall through to legacy */ }`.
|
||||
If a row *is* the current typed shape (it has `endpointUrl`) but is corrupt in a way that
|
||||
makes `JsonSerializer.Deserialize` throw a `JsonException` — e.g. an enum-valued field
|
||||
holding an unrecognised string, or a numeric field holding a non-numeric token — the
|
||||
exception is swallowed and control falls through to `LoadLegacy`. `LoadLegacy` only
|
||||
requires the root to be a JSON object, so it will usually succeed against the same input
|
||||
and the result is reported as `OpcUaConfigParseStatus.Legacy`. The Commons-005 fix added
|
||||
the `Malformed` status precisely so a caller can tell a recoverable legacy row from
|
||||
unparseable data; this path re-introduces a softer version of the same confusion — a
|
||||
genuinely broken current-shape row is presented to the user as a benign "please re-save"
|
||||
legacy row, and the offending field is silently dropped by `FromFlatDict` (which ignores
|
||||
keys it cannot parse) rather than surfaced. The XML doc describes the legacy fallback as
|
||||
being for "pre-refactor rows" only and does not mention this branch.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Only fall through to `LoadLegacy` when the typed shape is genuinely *not present* — i.e.
|
||||
the `endpointUrl` property is absent. When `endpointUrl` *is* present but typed
|
||||
deserialization throws, classify the outcome as `Malformed` (or a distinct status) so the
|
||||
caller can surface a real error instead of an empty/partial config. Tighten the XML doc
|
||||
to describe this branch, and add a regression test for a typed row with an invalid enum
|
||||
field.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Communication` |
|
||||
| Design doc | `docs/requirements/Component-Communication.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe
|
||||
good. Test coverage is broad for happy paths but has gaps around failover, cache
|
||||
mutation races, and the snapshot-timeout cleanup path.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All prior findings (Communication-001..011) are confirmed `Resolved` in this commit
|
||||
and the fixes hold up against the source. The re-review walked all 10 checklist
|
||||
categories again and uncovered a previously-missed defect at the centre of the gRPC
|
||||
node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per
|
||||
site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The
|
||||
`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other*
|
||||
node's endpoint, but the factory hands back the original NodeA-bound client every
|
||||
time — so the documented "try the other site node endpoint" failover never actually
|
||||
moves to NodeB (Communication-012). The same caching defect means a site address
|
||||
change is never picked up because `RemoveSiteAsync` has no production caller
|
||||
(Communication-013). Two Low findings round out the re-review: an untrusted
|
||||
gRPC-supplied `correlation_id` flows straight into an Akka actor name
|
||||
(Communication-014), and the factory's endpoint-reuse defect is masked by the test
|
||||
mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
|
||||
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
|
||||
| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
|
||||
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
|
||||
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
|
||||
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
|
||||
| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
|
||||
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
|
||||
| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
|
||||
| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. |
|
||||
| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. |
|
||||
| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. |
|
||||
| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. |
|
||||
| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). |
|
||||
| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. |
|
||||
| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. |
|
||||
| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. |
|
||||
| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). |
|
||||
| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -519,3 +536,151 @@ passes after):
|
||||
(added with this finding's resolution).
|
||||
The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at
|
||||
111 passing tests.
|
||||
|
||||
### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ =>
|
||||
CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the
|
||||
`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every
|
||||
subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`,
|
||||
which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first.
|
||||
|
||||
`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error,
|
||||
`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes
|
||||
`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls
|
||||
`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to
|
||||
the *other* node. Because the factory ignores the new endpoint, the bridge actor
|
||||
reconnects to the **same failed NodeA endpoint** on every retry. The design doc's
|
||||
core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if
|
||||
NodeA failed, or vice versa") is therefore inoperative — when a site node goes down,
|
||||
the debug stream cannot move to the surviving node and simply exhausts `MaxRetries`
|
||||
against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint`
|
||||
computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all
|
||||
dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions,
|
||||
but it unsubscribes and re-subscribes on the *same* client/node rather than the
|
||||
intended other node.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the per-site client aware of both endpoints, or key the cache by
|
||||
`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and
|
||||
dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if
|
||||
NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a
|
||||
per-site holder) both NodeA/NodeB addresses and let it switch channels internally,
|
||||
removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a
|
||||
real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client
|
||||
targets the other endpoint.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site
|
||||
removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but
|
||||
a repo-wide search finds **no production caller** — only tests invoke it. Combined
|
||||
with the cache-by-site-identifier behaviour (Communication-012), the consequence is
|
||||
that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's
|
||||
`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected
|
||||
in the cached client — it keeps using the stale channel for the life of the process.
|
||||
`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and
|
||||
recreates ClusterClients on change, but there is no equivalent invalidation path
|
||||
wired into the gRPC client factory. A site whose gRPC endpoints are corrected after
|
||||
an initial misconfiguration will never have working debug streaming until the central
|
||||
node is restarted.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory` —
|
||||
e.g. have `CentralCommunicationActor` (which already detects address changes in
|
||||
`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses
|
||||
changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If
|
||||
the on-the-fly address-change requirement is intentionally dropped, remove
|
||||
`RemoveSiteAsync` and correct the design doc.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Security |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the
|
||||
relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor
|
||||
name, where `request.CorrelationId` comes straight off the wire. Akka actor names have
|
||||
a restricted character set; a `correlation_id` containing `/`, whitespace, or other
|
||||
disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`.
|
||||
That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled
|
||||
RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has
|
||||
already been set up for the duration, though the `finally` does not run because the
|
||||
throw is before the `try`). In practice central always supplies a GUID, so impact is
|
||||
low, but the server is trusting client-supplied input to be actor-name-safe.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe
|
||||
pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor
|
||||
name solely from the internal `_actorCounter` and keep the correlation ID only as
|
||||
actor state / dictionary key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Communication-015 — No test exercises the real gRPC client factory across a node flip
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` |
|
||||
|
||||
**Description**
|
||||
|
||||
`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through
|
||||
`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of
|
||||
the `grpcEndpoint` argument. This is exactly the behaviour the *real*
|
||||
`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock
|
||||
masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real
|
||||
factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts
|
||||
`GetOrCreate` returns the same client for the same site — it never checks what happens
|
||||
when the same site is requested with a *different* endpoint.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)`
|
||||
then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB`
|
||||
(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor
|
||||
test's mock factory track the endpoint per call so node-flip coverage is meaningful.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
@@ -40,20 +40,20 @@ module file and counted in **Total**.
|
||||
| Severity | Open findings |
|
||||
|----------|---------------|
|
||||
| Critical | 0 |
|
||||
| High | 0 |
|
||||
| Medium | 0 |
|
||||
| Low | 0 |
|
||||
| **Total** | **0** |
|
||||
| High | 2 |
|
||||
| Medium | 5 |
|
||||
| Low | 10 |
|
||||
| **Total** | **17** |
|
||||
|
||||
## Module Status
|
||||
|
||||
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|
||||
|--------|---------------|--------|----------------|------|-------|
|
||||
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 |
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 8 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 16 |
|
||||
| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/3 | 6 | 25 |
|
||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
|
||||
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
|
||||
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
|
||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||
@@ -80,14 +80,34 @@ description, location, recommendation — lives in the module's `findings.md`.
|
||||
|
||||
_None open._
|
||||
|
||||
### High (0)
|
||||
### High (2)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
|
||||
| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
|
||||
|
||||
### Medium (0)
|
||||
### Medium (5)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CLI-014 | [CLI](CLI/findings.md) | `update` commands require "core" fields, making partial updates impossible |
|
||||
| CentralUI-021 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callback mutates `Dictionary` off the render thread |
|
||||
| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
|
||||
| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
|
||||
| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
|
||||
|
||||
### Low (0)
|
||||
### Low (10)
|
||||
|
||||
_None open._
|
||||
| ID | Module | Title |
|
||||
|----|--------|-------|
|
||||
| CLI-015 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface has drifted again in two places |
|
||||
| CLI-016 | [CLI](CLI/findings.md) | `WriteAsTable` derives columns from the first array element only |
|
||||
| CentralUI-023 | [CentralUI](CentralUI/findings.md) | Residual bare `catch {}` blocks swallow JS interop errors |
|
||||
| CentralUI-024 | [CentralUI](CentralUI/findings.md) | Claim lookups use magic strings instead of `JwtTokenService` constants |
|
||||
| CentralUI-025 | [CentralUI](CentralUI/findings.md) | `SessionExpiry` polling/redirect path has no test coverage |
|
||||
| ClusterInfrastructure-010 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator does not enforce `DownIfAlone = true` despite the design doc requiring it |
|
||||
| Commons-013 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryGetIndex` rejects non-`int` index values |
|
||||
| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
|
||||
| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
|
||||
| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
|
||||
|
||||
Reference in New Issue
Block a user