From e49846603e364c2fa8a5d96ce03df0ebf546aa1e Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 17 May 2026 00:41:21 -0400 Subject: [PATCH] =?UTF-8?q?docs(code-reviews):=20re-review=20batch=201=20a?= =?UTF-8?q?t=2039d737e=20=E2=80=94=20CentralUI,=20CLI,=20ClusterInfrastruc?= =?UTF-8?q?ture,=20Commons,=20Communication?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015. --- code-reviews/CLI/findings.md | 157 ++++++++++- code-reviews/CentralUI/findings.md | 261 +++++++++++++++++- .../ClusterInfrastructure/findings.md | 139 +++++++++- code-reviews/Commons/findings.md | 96 ++++++- code-reviews/Communication/findings.md | 191 ++++++++++++- code-reviews/README.md | 50 +++- 6 files changed, 842 insertions(+), 52 deletions(-) diff --git a/code-reviews/CLI/findings.md b/code-reviews/CLI/findings.md index f11efce..3ad0039 100644 --- a/code-reviews/CLI/findings.md +++ b/code-reviews/CLI/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.CLI` | | Design doc | `docs/requirements/Component-CLI.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 3 | ## Summary @@ -31,8 +31,26 @@ ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliCon `CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON argument parsing, and the command-tree wiring are untested. +#### Re-review 2026-05-17 (commit `39d737e`) + +All 13 prior findings are confirmed resolved — the resilience gaps, the dead format +configuration, the credential-handling weakness, and the test-coverage holes have all +been closed, and the test suite has grown substantially (`CommandTreeTests`, +`ManagementHttpClientTests`, `DebugStreamTests`, etc.). The CLI's runtime behaviour is now +solid. This re-review walked all 14 command groups against the full checklist and found +three new issues, all rooted in **update-command and design-document drift** rather than +runtime defects: every `update` command requires the entity's "core" fields (`--name`, +`--script`) even though the design doc presents them as optional, so a partial update is +impossible (CLI-014); the design doc's command surface has drifted again in two specific +places — `template composition delete` and the `data-connection` config flags (CLI-015); +and `WriteAsTable` derives table columns from only the first array element, silently +dropping columns for any later element with a different shape (CLI-016). No +Critical/High issues; the module remains healthy. + ## Checklist coverage +_Original review (2026-05-16, `9c60592`):_ + | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). | @@ -46,6 +64,21 @@ argument parsing, and the command-tree wiring are untested. | 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). | | 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). | +_Re-review (2026-05-17, `39d737e`):_ + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Update commands require "core" fields, blocking partial updates (CLI-014); `WriteAsTable` headers derived from first array element only (CLI-016). | +| 2 | Akka.NET conventions | ☑ | Not applicable — pure HTTP/SignalR client. No issues. | +| 3 | Concurrency & thread safety | ☑ | `debug stream` concurrency now resolved via `DebugStreamHelpers` (CLI-011/012). No new issues. | +| 4 | Error handling & resilience | ☑ | Malformed-URL / malformed-JSON / connect-cancellation paths all hardened (CLI-004/005/010). No new issues. | +| 5 | Security | ☑ | Env-var credential fallback in place (CLI-006). Basic Auth over HTTP is by design. No new issues. | +| 6 | Performance & resource management | ☑ | `CancellationTokenSource` now `using`-scoped (CLI-011). No new issues. | +| 7 | Design-document adherence | ☑ | Two residual command-surface drifts: `template composition delete` and `data-connection --primary-config` (CLI-015). | +| 8 | Code organization & conventions | ☑ | Consistent and clean; option construction centralised in `CliOptions`. No new issues. | +| 9 | Testing coverage | ☑ | Substantially expanded (`CommandTreeTests`, `ManagementHttpClientTests`, `DebugStreamTests`). No new gaps. | +| 10 | Documentation & comments | ☑ | XML docs accurate. `Component-CLI.md` drift folded into CLI-015. | + ## Findings ### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken @@ -537,3 +570,121 @@ Resolved 2026-05-16 (commit pending). Coverage gaps confirmed and closed: `InstanceCommands.TryParseBindings`/`TryParseOverrides` and covered by `InstanceArgumentParsingTests` under CLI-005. The CLI test suite went from 42 to 77 passing tests. + +### CLI-014 — `update` commands require "core" fields, making partial updates impossible + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/TemplateCommands.cs:77`, `src/ScadaLink.CLI/Commands/SiteCommands.cs:86`, `src/ScadaLink.CLI/Commands/ExternalSystemCommands.cs:40-42`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:39-40`, `src/ScadaLink.CLI/Commands/NotificationCommands.cs:40-41`, `src/ScadaLink.CLI/Commands/ApiMethodCommands.cs:79` | + +**Description** + +The design doc presents `update` commands with all non-`--id` fields as optional, e.g. +`template update --id [--name ] [--description ] [--parent-id ]` +(`Component-CLI.md:62`) and `api-method update --id [--script ] ...` +(`Component-CLI.md:224`). The implementation contradicts this: the update commands mark +the entity's "core" fields as `Required = true`, so the user must always re-supply them: + +- `template update` — `--name` is `Required = true` (`TemplateCommands.cs:77`). +- `site update` — `--name` is `Required = true` (`SiteCommands.cs:86`). +- `external-system update` — `--name`, `--endpoint-url`, `--auth-type` are all + `Required = true` (`ExternalSystemCommands.cs:40-42`). +- `data-connection update` — `--name`, `--protocol` are `Required = true` + (`DataConnectionCommands.cs:39-40`). +- `notification update` — `--name`, `--emails` are `Required = true` + (`NotificationCommands.cs:40-41`). +- `api-method update` — `--script` is `Required = true` (`ApiMethodCommands.cs:79`). +- The same pattern applies to `template attribute/alarm/script update` + (`TemplateCommands.cs:164-165`, `246-248`, `332-334`) and `role-mapping update` + (`SecurityCommands.cs:110-111`). + +Because the corresponding `Update*Command` records are whole-replace (they carry the full +field set, not a sparse patch), a user who wants to change only one field — e.g. flip an +API method's timeout, or rename a template — must look up and re-pass every other field's +current value. Omitting any required flag is a hard parse error. This makes scripted, +single-field updates (a core CLI/CI use case) awkward and error-prone, and it does not +match the documented optional-flag surface. + +**Recommendation** + +Decide on one model and align doc + code. Either (a) make the update flags genuinely +optional and have the server/`Update*Command` treat a null field as "leave unchanged" +(sparse patch), or (b) if whole-replace is intentional, update `Component-CLI.md` to show +these flags as required (no `[...]`) and document that an update replaces the whole +entity. Option (a) matches the documented surface and the typical CLI expectation. + +**Resolution** + +_Unresolved._ + +### CLI-015 — `Component-CLI.md` command surface has drifted again in two places + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `docs/requirements/Component-CLI.md:75`, `docs/requirements/Component-CLI.md:125-126` (vs. `src/ScadaLink.CLI/Commands/TemplateCommands.cs:404-413`, `src/ScadaLink.CLI/Commands/DataConnectionCommands.cs:41`, `:86`) | + +**Description** + +CLI-007 regenerated the doc's "Command Structure" section, but two specific drifts remain +or were introduced: + +- `Component-CLI.md:75` documents `template composition delete --template-id + --instance-name `, but the implementation (`TemplateCommands.cs:404-413`) deletes + a composition by its own integer ID via a single `--id` option + (`DeleteTemplateCompositionCommand(id)`). The doc's two-flag form does not exist. +- `data-connection create` and `update` accept a `--primary-config` option (aliased + `--configuration`) for the primary configuration JSON (`DataConnectionCommands.cs:86`, + `:41`), but `Component-CLI.md:125-126` lists only `--backup-config` and + `--failover-retry-count` — the primary-config flag is absent from the doc. + +A reader following the doc would use a non-existent `template composition delete` form and +would not discover the `--primary-config` flag. + +**Recommendation** + +Correct `Component-CLI.md:75` to `template composition delete --id `, and add +`[--primary-config ]` to the documented `data-connection create`/`update` signatures +(`Component-CLI.md:125-126`). Also note the `--configuration` alias if aliases are +documented elsewhere. + +**Resolution** + +_Unresolved._ + +### CLI-016 — `WriteAsTable` derives columns from the first array element only + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:184-200` | + +**Description** + +When rendering a JSON array as a table, `WriteAsTable` builds the header set from +`items[0].EnumerateObject()` only (`CommandHelpers.cs:184-186`) and then projects every +row against that fixed header list (`:188-198`). If a later element of the array has a +different shape — additional properties, or properties the first element lacks — those +extra columns are silently dropped from the table and a row missing a header property +renders an empty cell. The user sees a table that appears complete but has omitted data, +with no indication that columns were discarded. (The JSON output path is unaffected; this +only affects `--format table`.) Management API list responses are generally homogeneous, +so the practical impact is low, but a heterogeneous array — e.g. a diff or a mixed-status +list — would be rendered incorrectly with no warning. + +**Recommendation** + +Compute the header set as the union of property names across all array elements (iterate +all items, collect distinct property names preserving first-seen order) before projecting +rows, so no element's data is silently dropped. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/CentralUI/findings.md b/code-reviews/CentralUI/findings.md index c529247..62b2914 100644 --- a/code-reviews/CentralUI/findings.md +++ b/code-reviews/CentralUI/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.CentralUI` | | Design doc | `docs/requirements/Component-CentralUI.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 6 | ## Summary @@ -32,6 +32,24 @@ Testing coverage is thin for a module this large: only the script analyzer, TreeView, schema model, and a few data-connection pages have unit tests; most pages and the auth bridge are untested. +#### Re-review 2026-05-17 (commit `39d737e`) + +All 19 findings from the 2026-05-16 review are confirmed closed. The resolution +batch (`a9bd7ee`..`34588ae`) substantially rewrote the auth bridge, the script +sandbox, several Deployment/Monitoring pages, and the shared component disposal +paths, so this re-review re-examined the post-fix state across all 10 checklist +categories. Six new findings (CentralUI-020 .. 025) were recorded. The most +important is **CentralUI-020**: the two prior fixes interact destructively — the +CentralUI-004 fix made `CookieAuthenticationStateProvider` return a frozen, +constructor-time auth-state snapshot, while the CentralUI-005 fix rewrote +`SessionExpiry.razor` to *poll* that same provider to detect a lapsed session. +Because the snapshot never changes for the life of the circuit, the idle-timeout +redirect can never fire, so the documented idle-logout behaviour is silently +defeated. The remaining new findings are a cross-thread `Dictionary` mutation in +`DebugView`, an unguarded `InvokeAsync` in the new `Deployments` push handler, +and three Low-severity items (residual bare `catch`, magic-string claim +lookups, and the untested `SessionExpiry` polling path). + ## Checklist coverage | # | Category | Examined | Notes | @@ -47,6 +65,14 @@ pages and the auth bridge are untested. | 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. | | 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. | +Re-review 2026-05-17 (`39d737e`): all 10 categories re-examined against the +post-fix source. New findings — category 3 (CentralUI-020 the auth-snapshot vs +session-poll interaction is also a design-adherence regression; CentralUI-021 +cross-thread `Dictionary`; CentralUI-022 unguarded `InvokeAsync`), category 4 +(CentralUI-023 residual bare `catch`), category 8 (CentralUI-024 magic-string +claims), category 9 (CentralUI-025 untested `SessionExpiry` poll). Categories +1, 2, 5, 6, 7, 10 produced no new findings. + ## Findings ### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement @@ -961,3 +987,232 @@ it was failing on the baseline because `DeploymentService` had gained a `DiffService` constructor dependency from a DeploymentManager contract change that the test fixture had not been updated for; `DiffService` is now registered in the fixture.) + +### CentralUI-020 — Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:39-62`; `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:29-43` | + +**Description** + +The CentralUI-004 fix and the CentralUI-005 fix interact destructively. +CentralUI-004 made `CookieAuthenticationStateProvider` snapshot the principal +**once** in its constructor into a cached `Task` and serve +that exact task for the entire life of the SignalR circuit — it never re-reads +`HttpContext`, never calls `SetAuthenticationState`, and never raises +`NotifyAuthenticationStateChanged`. CentralUI-005 then rewrote +`SessionExpiry.razor` to *poll* `AuthStateProvider.GetAuthenticationStateAsync()` +once a minute and redirect to `/login` "once the sliding cookie has actually +lapsed server-side." But `GetAuthenticationStateAsync()` returns the same frozen +constructor-time snapshot on every call — `auth.User.Identity.IsAuthenticated` +is permanently `true` for the life of the circuit regardless of whether the +server-side cookie has expired. The poll loop therefore never observes an +expired session and the redirect never fires. An idle user whose cookie has +lapsed server-side keeps an authenticated-looking page open indefinitely; the +documented "30-minute idle timeout" is silently defeated for any user who +leaves a circuit open. (The cookie middleware would still reject the *next* +full HTTP request / new circuit, so this is a stale-UI / missed-logout exposure +rather than a full auth bypass — but the page continues to render +authenticated content and a SignalR circuit can stay alive for a long time.) +This is also a design-document-adherence regression against CLAUDE.md +"Security & Auth" (30-minute idle timeout) — recorded under Concurrency because +the root cause is the lifetime/staleness mismatch between the two components. + +**Recommendation** + +`SessionExpiry` must consult something that actually reflects the live +server-side session, not the circuit's frozen principal. Options: (a) have +`SessionExpiry` poll a lightweight authenticated server endpoint (e.g. a +`/auth/ping` minimal API that returns 401 once the cookie has lapsed) and +redirect on 401; or (b) give `CookieAuthenticationStateProvider` a refresh path +that re-validates the cookie and calls `SetAuthenticationState` / +`NotifyAuthenticationStateChanged` so the polled state can actually change. +Whichever is chosen, add a test that exercises the redirect path with an +expired session (see CentralUI-025). + +**Resolution** + +_Unresolved._ + +### CentralUI-021 — `DebugView` stream callback mutates `Dictionary` off the render thread + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:404-419,511-519,275-289` | + +**Description** + +The `onEvent` callback passed to `DebugStreamService.StartStreamAsync` runs on +an Akka/gRPC thread (as the design doc and the CentralUI-009 comments state). It +calls `UpsertWithCap(_attributeValues, …)` / `UpsertWithCap(_alarmStates, …)` +**directly on that thread** — the mutation is not marshalled through +`InvokeAsync`; only the subsequent `StateHasChanged` is. Meanwhile the render +thread evaluates `FilteredAttributeValues` / `FilteredAlarmStates`, which +enumerate `_attributeValues.Values` / `_alarmStates.Values` and call +`OrderBy(...).ToList()`. `Dictionary` is not thread-safe: a write +on the Akka thread concurrent with an enumeration on the render thread can throw +`InvalidOperationException` ("Collection was modified; enumeration operation may +not execute") or corrupt the dictionary's internal buckets. The CentralUI-009 +fix added a `_disposed` guard but did not address this data race — the guard +only prevents touching a *disposed* component, not concurrent access to a live +one. Under a busy debug stream this will intermittently fault the page. + +**Recommendation** + +Marshal the dictionary mutation onto the render thread too — move the +`UpsertWithCap` call inside the `SafeInvokeAsync`/`InvokeAsync` body so all +access to `_attributeValues`/`_alarmStates` happens on the renderer's +dispatcher. Alternatively guard both the writes and the `Filtered*` reads with a +lock, or use a concurrent collection. The cap-trim loop must be inside the same +critical section as the upsert. + +**Resolution** + +_Unresolved._ + +### CentralUI-022 — `Deployments` push handler fires `InvokeAsync` with no disposal guard + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:221-229,317-322` | + +**Description** + +`OnDeploymentStatusChanged` is invoked by `IDeploymentStatusNotifier`, a process +singleton, on the DeploymentManager service thread. The handler does +`_ = InvokeAsync(async () => { await LoadDataAsync(); StateHasChanged(); })`, +discarding the returned task. `Dispose()` unsubscribes the handler, but there is +a race window: the notifier can read the subscriber list and begin invoking +`OnDeploymentStatusChanged` *just before* the component is disposed, so +`InvokeAsync` then runs against a disposed component and throws +`ObjectDisposedException` on the DeploymentManager thread — an unobserved task +exception (the task is fire-and-forget). The same hazard was explicitly fixed +for `DebugView` (CentralUI-009, `SafeInvokeAsync` + `_disposed` flag) and +`ToastNotification` (CentralUI-010), but the new push-based `Deployments` +handler introduced by the CentralUI-006 fix did not adopt the same guard. +Separately, every push event triggers two full repository reloads +(`GetAllInstancesAsync` + `GetAllDeploymentRecordsAsync`) for every open +circuit, so a burst of status changes amplifies into N×2 round-trips per tick. + +**Recommendation** + +Add a `volatile bool _disposed` set first in `Dispose()`, have +`OnDeploymentStatusChanged` no-op when set, and wrap the `InvokeAsync` dispatch +to swallow `ObjectDisposedException` (mirror `DebugView.SafeInvokeAsync`). +Optionally coalesce bursts (debounce) and/or reload only the changed record +rather than the whole table on each event. + +**Resolution** + +_Unresolved._ + +### CentralUI-023 — Residual bare `catch {}` blocks swallow JS interop errors + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:690-698`; `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:107-116,118-130,104` | + +**Description** + +CentralUI-018 narrowed the bare `catch {}` blocks in `MonacoEditor`, +`TreeView`, and `Sites.razor`, but the same pattern survives elsewhere. +`ParkedMessages.CopyAsync` wraps `navigator.clipboard.writeText` in +`catch { _toast.ShowError("Copy failed."); }` — a real `JSException` +(clipboard permission denied) and an expected `JSDisconnectedException` are +treated identically and neither is logged. `DiffDialog.TryLockBodyAsync` / +`TryUnlockBodyAsync` each have a bare outer `catch` whose handler does another +JS call wrapped in a second bare `catch { /* swallow */ }`, and +`OnAfterRenderAsync`'s `_modalRef.FocusAsync()` is wrapped in a bare +`catch { /* prerender or detached: ignore */ }`. Genuine interop failures in +these paths are invisible in production logs. + +**Recommendation** + +Catch `JSDisconnectedException` silently and `JSException` (and +`InvalidOperationException` for the prerender focus case) with an `ILogger` +call, consistent with the CentralUI-018 fixes in the same module. + +**Resolution** + +_Unresolved._ + +### CentralUI-024 — Claim lookups use magic strings instead of `JwtTokenService` constants + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:102`; `src/ScadaLink.CentralUI/Components/Pages/Dashboard.razor:14`; `GetCurrentUserAsync` in `Templates.razor`, `TemplateEdit.razor`, `TemplateCreate.razor`, `SharedScripts.razor`, `SharedScriptForm.razor`, `Sites.razor`, `Topology.razor`, `InstanceCreate.razor`, `InstanceConfigure.razor` | + +**Description** + +`ScadaLink.Security.JwtTokenService` exposes the canonical claim-type constants +(`UsernameClaimType = "Username"`, `DisplayNameClaimType = "DisplayName"`, +`RoleClaimType`, `SiteIdClaimType`). `SiteScopeService` correctly uses +`JwtTokenService.SiteIdClaimType`, but every `GetCurrentUserAsync` helper across +ten pages does `authState.User.FindFirst("Username")?.Value`, and `NavMenu` / +`Dashboard` do `context.User.FindFirst("DisplayName")`. The literals happen to +match the constants today, so there is no live bug — but if the claim type is +ever renamed in `JwtTokenService` (the single source of truth) every one of +these call sites silently breaks, falling back to `"unknown"` for the audit +user and a blank display name. The duplicated `GetCurrentUserAsync` helper is +also copy-pasted verbatim into ten components. + +**Recommendation** + +Replace the string literals with `JwtTokenService.UsernameClaimType` / +`DisplayNameClaimType`. Consider extracting the repeated `GetCurrentUserAsync` +into a single shared helper (e.g. an extension on `AuthenticationStateProvider` +or a small scoped service) so the claim lookup lives in exactly one place. + +**Resolution** + +_Unresolved._ + +### CentralUI-025 — `SessionExpiry` polling/redirect path has no test coverage + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.CentralUI.Tests/Auth/SessionExpiryPolicyTests.cs`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor` | + +**Description** + +`SessionExpiryPolicyTests` covers only `AuthEndpoints.BuildSignInProperties()` +(the sign-in properties shape). The actual runtime behaviour of +`SessionExpiry.razor` — that an expired session triggers a redirect to +`/login`, that an authenticated session does not, and that the component does +not poll/redirect on the `/login` page itself — is untested. Had a behavioural +test exercised the redirect with an expired/anonymous auth state against the +real `CookieAuthenticationStateProvider`, the CentralUI-020 defect (the frozen +snapshot never reporting an expired session) would have been caught. The +component is the system's only client-side idle-logout mechanism, so the gap is +material. + +**Recommendation** + +Add bUnit tests for `SessionExpiry`: (a) with an unauthenticated auth state the +component navigates to `/login`; (b) with an authenticated state it does not; +(c) on the `/login` route it neither polls nor redirects. The provider used in +the test must be one whose state can actually transition to expired — which +also forces the CentralUI-020 fix. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/ClusterInfrastructure/findings.md b/code-reviews/ClusterInfrastructure/findings.md index 5007e9d..0b79077 100644 --- a/code-reviews/ClusterInfrastructure/findings.md +++ b/code-reviews/ClusterInfrastructure/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.ClusterInfrastructure` | | Design doc | `docs/requirements/Component-ClusterInfrastructure.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 2 | ## Summary @@ -29,20 +29,39 @@ every other component runs on, yet it presently delivers nothing the design requ The single options class is clean and its test covers defaults and setters adequately for what exists. +**Re-review 2026-05-17 (commit `39d737e`).** All eight prior findings (CI-001..008) +were resolved by the batch of work between `9c60592` and `39d737e`: `ClusterOptions` +gained XML docs, the `SectionName` constant, and the `DownIfAlone` property; +`ClusterOptionsValidator` was added; `ServiceCollectionExtensions` now registers the +validator and throws from the dead actor-registration method; and the test project +grew to 16 cases across three test classes. The module is in good shape — the +`ClusterOptions` contract, its validator, and the DI registration are all sound, +well-documented, and well-tested. This re-review examined all three source files and +all three test files against the full 10-category checklist and found **two new +issues**, both stemming from work the prior review explicitly deferred to a "Host +review" that has not happened: the `DownIfAlone` property is exposed and validated as +part of the configuration contract but is never consumed — `ScadaLink.Host`'s +`BuildHocon` still hard-codes `down-if-alone = on` (CI-009, Medium) — and the validator +does not enforce the design doc's requirement that `down-if-alone` be `on` for the +keep-oldest resolver, so `DownIfAlone = false` is silently accepted (CI-010, Low). + ## Checklist coverage +Original review (2026-05-16, `9c60592`) below; the re-review notes (2026-05-17, +`39d737e`) are appended in each row. + | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). | -| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. | -| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. | -| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. | -| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. | -| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. | -| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). | -| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. | -| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). | -| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). | +| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). **Re-review:** CI-002 resolved. New — `DownIfAlone` is a settable property that controls nothing because the HOCON builder hard-codes the value (CI-009). | +| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. **Re-review:** confirmed the Akka bootstrap legitimately lives in `ScadaLink.Host` (CI-001 resolution); still nothing actor-related in this module. No issues. | +| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. **Re-review:** validator and DI extensions are stateless; no issues. | +| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. **Re-review:** the validator now fails fast on misconfiguration. New — it does not enforce the design doc's `down-if-alone = on` requirement (CI-010). | +| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. **Re-review:** still no authn/authz surface, no secret handling. No issues. | +| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. **Re-review:** no resources held; the validator allocates a small failure list per call only. No issues. | +| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). **Re-review:** CI-001/CI-003 resolved (ownership split documented; `DownIfAlone` added). New — `DownIfAlone` was added to the contract but never wired into the HOCON (CI-009). | +| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. **Re-review:** CI-004/CI-005 resolved; `SectionName` constant present and options/validator placement correct. No issues. | +| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). **Re-review:** CI-006 resolved — 16 tests across three classes covering options, validator, and DI registration. No `DownIfAlone`-wiring test exists, but that wiring lives in the Host (CI-009). No new issue here. | +| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). **Re-review:** CI-007/CI-008 resolved — full XML docs on all members; skeleton comments gone. Note: the `DownIfAlone` XML doc calls `true` "the design-doc requirement" yet the value is inert (CI-009) and unenforced (CI-010). | ## Findings @@ -490,3 +509,97 @@ section, and the component table reflects the true placement. This is a documentation-only finding, so no runtime regression test is meaningful; verified by inspection of `ServiceCollectionExtensions.cs` and `docs/requirements/Component-ClusterInfrastructure.md:21-39`. + +### ClusterInfrastructure-009 — `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74` | + +**Description** + +The `DownIfAlone` property was added to `ClusterOptions` by CI-003's resolution as +part of "the split-brain configuration contract". It is public, defaults to `true`, +carries an XML doc presenting it as "the design-doc requirement", and is exercised by +`ClusterOptionsTests.DownIfAlone_CanBeSet`. However, nothing in the system reads it. +The Akka.NET HOCON is generated by `ScadaLink.Host.Actors.AkkaHostedService.BuildHocon`, +which **hard-codes** the resolver setting: + +``` +split-brain-resolver { + active-strategy = ... + stable-after = ... + keep-oldest { + down-if-alone = on + } +} +``` + +`BuildHocon` receives the full `ClusterOptions` instance and consumes every other +field (`SeedNodes`, `MinNrOfMembers`, `SplitBrainResolverStrategy`, `StableAfter`, +`HeartbeatInterval`, `FailureDetectionThreshold`) but ignores `DownIfAlone` entirely. +The result is a configuration property that an operator can set in `appsettings.json`, +that passes validation, and that has **zero runtime effect** — setting +`DownIfAlone: false` does not turn the flag off. CI-003's resolution explicitly +acknowledged this gap ("wiring it to read `DownIfAlone` is a one-line `ScadaLink.Host` +change ... noted for the Host's review") but the wiring was never done and no tracked +finding carried it, so the gap has silently persisted to commit `39d737e`. An inert, +misleadingly-documented configuration knob is a correctness and design-adherence +defect: it gives operators a false sense of control over a safety-critical resolver +setting. + +**Recommendation** + +Either (a) wire `DownIfAlone` into `BuildHocon` — emit `down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}` +— so the property does what its XML doc claims (a Host-side change, to be tracked in +the Host module's review since `BuildHocon` lives there), or (b) if the flag is +intentionally fixed at `on` and must never be operator-configurable, remove the +`DownIfAlone` property from `ClusterOptions` and document the hard-coded `on` value as +a non-negotiable invariant. Do not leave a public, settable, validated property that +controls nothing. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-010 — Validator does not enforce `DownIfAlone = true` despite the design doc requiring it + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptionsValidator.cs:21-71` | + +**Description** + +`Component-ClusterInfrastructure.md` (Split-Brain Resolution) states the keep-oldest +resolver must be configured with `down-if-alone = on`, and the XML doc on +`ClusterOptions.DownIfAlone` calls `true` "the design-doc requirement" — the rationale +being that without it the oldest node can run as an isolated single-node cluster +during a partition while the younger node forms its own. `ClusterOptionsValidator` +guards every other safety-critical setting (`MinNrOfMembers == 1`, `keep-oldest`-only +strategy, positive timings, heartbeat below the failure threshold) but performs no +check on `DownIfAlone`. A configuration of `DownIfAlone: false` therefore passes +validation cleanly. This is currently latent because CI-009 shows the property is not +consumed at all, but the moment CI-009 is fixed by wiring the property into the HOCON +(option (a)), `DownIfAlone: false` would silently produce the unsafe single-node +behaviour the design doc explicitly forbids — with no fail-fast guard. The validator +is the right place to enforce the invariant, consistent with how it already rejects +quorum split-brain strategies. + +**Recommendation** + +If CI-009 is resolved by keeping `DownIfAlone` configurable, add a check to +`ClusterOptionsValidator.Validate` that fails when `DownIfAlone` is `false` (or, if +some future deployment legitimately needs it off, fails only in combination with the +`keep-oldest` strategy), with a message explaining the isolated-single-node-cluster +hazard. If CI-009 is resolved by removing the property, this finding is moot and +should be closed as resolved alongside it. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/Commons/findings.md b/code-reviews/Commons/findings.md index 4e02c8f..0c4a53a 100644 --- a/code-reviews/Commons/findings.md +++ b/code-reviews/Commons/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.Commons` | | Design doc | `docs/requirements/Component-Commons.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 2 | ## Summary @@ -32,14 +32,28 @@ kind of edge-case logic that warrants them. Entity and message contracts otherwi clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a wire command. +**Re-review 2026-05-17 (commit `39d737e`).** All twelve prior findings (Commons-001 +through Commons-012) are confirmed `Resolved` — the fixes are sound, well-targeted, and +backed by focused regression tests (`StaleTagMonitorRaceTests`, `DynamicJsonElementTests`, +`ScriptParametersTests`, `ManagementCommandRegistryTests`, `OpcUaEndpointConfigSerializerTests`, +`ResultTests`, `ValueFormatterTests`, `ConnectionBindingSerializationTests`, +`FlatteningAndScriptScopeTests`). The new files introduced since `9c60592` +(`TemplateAlarm` lock/inherit fields, `IExternalSystemRepository` name-keyed lookups, +`DeploymentStateQueryRequest`/`Response`, `ParameterDefinition`) follow the established +POCO / record / additive-evolution conventions and carry round-trip compatibility tests. +Two new Low-severity findings were recorded this pass: a `DynamicJsonElement` array +indexer that rejects `long` indices (Commons-013) and an `OpcUaEndpointConfigSerializer` +legacy-fallback path that can mislabel a corrupt new-shape row as `Legacy` (Commons-014). +No Critical, High, or Medium issues were found. + ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result` allows null error; legacy-config fallback loses data. | +| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result` allows null error; legacy-config fallback loses data. Re-review: `DynamicJsonElement.TryGetIndex` rejects non-`int` indices (Commons-013). | | 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. | | 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). | -| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). | +| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). Re-review: corrupt typed OPC UA rows can fall through to the legacy path and be mislabelled `Legacy` (Commons-014). | | 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. | | 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). | | 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. | @@ -566,3 +580,75 @@ the parameterless `ToString()`). The XML doc gained a remarks block stating the culture-invariant contract and why. Regression tests added in `ValueFormatterTests` (`FormatDisplayValue_Double_UsesInvariantCulture_*`, `_DateTime_*`, `_CollectionOfDoubles_*`, each pinned under `de-DE`). + +### Commons-013 — `DynamicJsonElement.TryGetIndex` rejects non-`int` index values + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:40-54` | + +**Description** + +`TryGetIndex` accepts an index only when `indexes[0] is int index`. `DynamicJsonElement` +is designed for dynamic access from scripts (`obj.items[0]`). In a `dynamic` expression the +index operand's runtime type follows the script's variable type — a script that computes +an index in a loop counter or reads it from another `DynamicJsonElement` (whose numbers +are unwrapped as `long` by `Wrap`, see `:105`) will pass a `long`, not an `int`. The +pattern match then fails, `TryGetIndex` returns `false`, and the dynamic binder throws a +`RuntimeBinderException` for what is a perfectly valid in-range index. Because the wrapper +itself surfaces JSON numbers as `long`, `obj.items[obj.count - 1]` — count being a wrapped +JSON number — is the exact failing case. The `int`-only guard also silently rejects +`byte`/`short` indices that would widen to a valid array position. + +**Recommendation** + +Accept any integral index by converting through `Convert.ToInt64` (guarded for +`OverflowException`) or by matching `int`, `long`, `short`, `byte` and normalizing to a +single integer before the bounds check. Add a regression test indexing with a `long`. + +**Resolution** + +_Unresolved._ + +### Commons-014 — `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:107-131` | + +**Description** + +`Deserialize` tries the typed path first: it parses the document, checks for an +`endpointUrl` property, then calls `JsonSerializer.Deserialize`. +The whole block is wrapped in `catch (JsonException) { /* fall through to legacy */ }`. +If a row *is* the current typed shape (it has `endpointUrl`) but is corrupt in a way that +makes `JsonSerializer.Deserialize` throw a `JsonException` — e.g. an enum-valued field +holding an unrecognised string, or a numeric field holding a non-numeric token — the +exception is swallowed and control falls through to `LoadLegacy`. `LoadLegacy` only +requires the root to be a JSON object, so it will usually succeed against the same input +and the result is reported as `OpcUaConfigParseStatus.Legacy`. The Commons-005 fix added +the `Malformed` status precisely so a caller can tell a recoverable legacy row from +unparseable data; this path re-introduces a softer version of the same confusion — a +genuinely broken current-shape row is presented to the user as a benign "please re-save" +legacy row, and the offending field is silently dropped by `FromFlatDict` (which ignores +keys it cannot parse) rather than surfaced. The XML doc describes the legacy fallback as +being for "pre-refactor rows" only and does not mention this branch. + +**Recommendation** + +Only fall through to `LoadLegacy` when the typed shape is genuinely *not present* — i.e. +the `endpointUrl` property is absent. When `endpointUrl` *is* present but typed +deserialization throws, classify the outcome as `Malformed` (or a distinct status) so the +caller can surface a real error instead of an empty/partial config. Tighten the XML doc +to describe this branch, and add a regression test for a typed row with an invalid enum +field. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/Communication/findings.md b/code-reviews/Communication/findings.md index a62220e..f3faf39 100644 --- a/code-reviews/Communication/findings.md +++ b/code-reviews/Communication/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.Communication` | | Design doc | `docs/requirements/Component-Communication.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 4 | ## Summary @@ -25,20 +25,37 @@ CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is othe good. Test coverage is broad for happy paths but has gaps around failover, cache mutation races, and the snapshot-timeout cleanup path. +#### Re-review 2026-05-17 (commit `39d737e`) + +All prior findings (Communication-001..011) are confirmed `Resolved` in this commit +and the fixes hold up against the source. The re-review walked all 10 checklist +categories again and uncovered a previously-missed defect at the centre of the gRPC +node-failover path: **`SiteStreamGrpcClientFactory.GetOrCreate` caches one client per +site identifier and silently ignores the `grpcEndpoint` argument on a cache hit**. The +`DebugStreamBridgeActor` reconnect logic flips `_useNodeA` and passes the *other* +node's endpoint, but the factory hands back the original NodeA-bound client every +time — so the documented "try the other site node endpoint" failover never actually +moves to NodeB (Communication-012). The same caching defect means a site address +change is never picked up because `RemoveSiteAsync` has no production caller +(Communication-013). Two Low findings round out the re-review: an untrusted +gRPC-supplied `correlation_id` flows straight into an Akka actor name +(Communication-014), and the factory's endpoint-reuse defect is masked by the test +mock (Communication-015). Four new findings, all Open: one High, one Medium, two Low. + ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. | -| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. | -| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. | -| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. | -| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). | -| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. | -| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. | -| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. | -| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. | -| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. | +| 1 | Correctness & logic bugs | ✓ | Re-review: factory ignores endpoint on cache hit, defeating NodeA→NodeB stream failover (Communication-012). Prior items resolved. | +| 2 | Akka.NET conventions | ✓ | Coordinator `Resume` strategies now present and verified. No new issues. | +| 3 | Concurrency & thread safety | ✓ | Subscription-map register/remove now ownership-checked. `_siteClients` readonly. No new issues. | +| 4 | Error handling & resilience | ✓ | `Status.Failure` handler added; reconnect unsubscribes prior stream. No new issues. | +| 5 | Security | ✓ | Re-review: public gRPC `correlation_id` flows unvalidated into an Akka actor name (Communication-014). | +| 6 | Performance & resource management | ✓ | Synchronous `Dispose` paths fixed; CTS leaks resolved. No new issues. | +| 7 | Design-document adherence | ✓ | Re-review: site gRPC address-change disposal not wired — `RemoveSiteAsync` is dead code (Communication-013). gRPC options now applied. | +| 8 | Code organization & conventions | ✓ | Options pattern correct; public records still declared in actor files (acceptable). No structural issues. | +| 9 | Testing coverage | ✓ | Re-review: prior gaps closed, but the factory mock masks the endpoint-reuse defect — no real node-flip coverage (Communication-015). | +| 10 | Documentation & comments | ✓ | `DebugStreamBridgeActor` summary corrected. No new issues. | ## Findings @@ -519,3 +536,151 @@ passes after): (added with this finding's resolution). The full module suite (`dotnet test tests/ScadaLink.Communication.Tests`) is green at 111 passing tests. + +### Communication-012 — gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:39`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:166` | + +**Description** + +`SiteStreamGrpcClientFactory.GetOrCreate` is `_clients.GetOrAdd(siteIdentifier, _ => +CreateClient(grpcEndpoint))` — it keys the cache by **site identifier only** and the +`grpcEndpoint` argument is used *exclusively* for the first-ever creation. Every +subsequent call for that site returns the originally-cached `SiteStreamGrpcClient`, +which is permanently bound to the `GrpcChannel` of whatever endpoint was passed first. + +`DebugStreamBridgeActor` relies on the opposite behaviour. On a gRPC stream error, +`HandleGrpcError` flips `_useNodeA` and `OpenGrpcStream` recomputes +`endpoint = _useNodeA ? _grpcNodeAAddress : _grpcNodeBAddress`, then calls +`_grpcFactory.GetOrCreate(_siteIdentifier, endpoint)` expecting a client connected to +the *other* node. Because the factory ignores the new endpoint, the bridge actor +reconnects to the **same failed NodeA endpoint** on every retry. The design doc's +core debug-stream failover behaviour ("tries the other site node endpoint", "NodeB if +NodeA failed, or vice versa") is therefore inoperative — when a site node goes down, +the debug stream cannot move to the surviving node and simply exhausts `MaxRetries` +against the dead endpoint and terminates. The `_useNodeA` flip, the `previousEndpoint` +computation in `HandleGrpcError`, and the `CleanupGrpc` endpoint selection are all +dead logic. (Communication-002's `Unsubscribe`-before-reconnect fix still functions, +but it unsubscribes and re-subscribes on the *same* client/node rather than the +intended other node.) + +**Recommendation** + +Make the per-site client aware of both endpoints, or key the cache by +`(siteIdentifier, endpoint)`, or have `GetOrCreate` detect an endpoint change and +dispose+recreate the cached client. Given the design intent ("Falls back to NodeB if +NodeA connection fails"), the cleanest fix is to give `SiteStreamGrpcClient` (or a +per-site holder) both NodeA/NodeB addresses and let it switch channels internally, +removing the endpoint argument from `GetOrCreate` entirely. Add a test that drives a +real `SiteStreamGrpcClientFactory` through a node flip and asserts the second client +targets the other endpoint. + +**Resolution** + +_Unresolved._ + +### Communication-013 — Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:58` | + +**Description** + +The design doc states that `SiteStreamGrpcClientFactory` "Disposes clients on site +removal or address change." `RemoveSiteAsync` implements the disposal mechanism, but +a repo-wide search finds **no production caller** — only tests invoke it. Combined +with the cache-by-site-identifier behaviour (Communication-012), the consequence is +that once a site's `SiteStreamGrpcClient` is created, a later edit to that site's +`GrpcNodeAAddress` / `GrpcNodeBAddress` (via the Central UI or CLI) is never reflected +in the cached client — it keeps using the stale channel for the life of the process. +`CentralCommunicationActor` already refreshes the *Akka* address cache every 60s and +recreates ClusterClients on change, but there is no equivalent invalidation path +wired into the gRPC client factory. A site whose gRPC endpoints are corrected after +an initial misconfiguration will never have working debug streaming until the central +node is restarted. + +**Recommendation** + +Wire a site-removal / address-change signal into `SiteStreamGrpcClientFactory` — +e.g. have `CentralCommunicationActor` (which already detects address changes in +`HandleSiteAddressCacheLoaded`) call `RemoveSiteAsync` for sites whose gRPC addresses +changed or were removed, or fold the gRPC endpoints into the same refresh cycle. If +the on-the-fly address-change requirement is intentionally dropped, remove +`RemoveSiteAsync` and correct the design doc. + +**Resolution** + +_Unresolved._ + +### Communication-014 — Untrusted gRPC `correlation_id` flows directly into an Akka actor name + +| | | +|--|--| +| Severity | Low | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:124` | + +**Description** + +`SubscribeInstance` is a public gRPC endpoint hosted on each site node. It creates the +relay actor with `$"stream-relay-{request.CorrelationId}-{actorSeq}"` as the actor +name, where `request.CorrelationId` comes straight off the wire. Akka actor names have +a restricted character set; a `correlation_id` containing `/`, whitespace, or other +disallowed characters makes `ActorSystem.ActorOf` throw `InvalidActorNameException`. +That exception is not caught inside `SubscribeInstance`, so it escapes as an unhandled +RPC fault (and after the `_streamSubscriber.Subscribe` / `_activeStreams` entry has +already been set up for the duration, though the `finally` does not run because the +throw is before the `try`). In practice central always supplies a GUID, so impact is +low, but the server is trusting client-supplied input to be actor-name-safe. + +**Recommendation** + +Validate `request.CorrelationId` on entry (non-empty, matches an expected GUID/safe +pattern) and reject with `StatusCode.InvalidArgument` otherwise; or derive the actor +name solely from the internal `_actorCounter` and keep the correlation ID only as +actor state / dictionary key. + +**Resolution** + +_Unresolved._ + +### Communication-015 — No test exercises the real gRPC client factory across a node flip + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.Communication.Tests/Grpc/DebugStreamBridgeActorTests.cs:401`, `tests/ScadaLink.Communication.Tests/Grpc/SiteStreamGrpcClientFactoryTests.cs` | + +**Description** + +`DebugStreamBridgeActorTests` exercises the reconnect/failover paths through +`MockSiteStreamGrpcClientFactory`, which returns one fixed mock client regardless of +the `grpcEndpoint` argument. This is exactly the behaviour the *real* +`SiteStreamGrpcClientFactory` exhibits incorrectly (Communication-012), so the mock +masks the defect: `On_GrpcError_Reconnects_To_Other_Node` passes even though the real +factory never reaches the other node. `SiteStreamGrpcClientFactoryTests` only asserts +`GetOrCreate` returns the same client for the same site — it never checks what happens +when the same site is requested with a *different* endpoint. + +**Recommendation** + +Add a `SiteStreamGrpcClientFactoryTests` case that calls `GetOrCreate(site, endpointA)` +then `GetOrCreate(site, endpointB)` and asserts the second call targets `endpointB` +(it should fail today and pass after Communication-012 is fixed). Have the bridge-actor +test's mock factory track the endpoint per call so node-flip coverage is meaningful. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/README.md b/code-reviews/README.md index 493c0da..4e84393 100644 --- a/code-reviews/README.md +++ b/code-reviews/README.md @@ -40,20 +40,20 @@ module file and counted in **Total**. | Severity | Open findings | |----------|---------------| | Critical | 0 | -| High | 0 | -| Medium | 0 | -| Low | 0 | -| **Total** | **0** | +| High | 2 | +| Medium | 5 | +| Low | 10 | +| **Total** | **17** | ## Module Status | Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total | |--------|---------------|--------|----------------|------|-------| -| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | -| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 19 | -| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 8 | -| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 | -| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | +| [CLI](CLI/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/2 | 3 | 16 | +| [CentralUI](CentralUI/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/3 | 6 | 25 | +| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 | +| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 | +| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 | | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | | [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | | [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 | @@ -80,14 +80,34 @@ description, location, recommendation — lives in the module's `findings.md`. _None open._ -### High (0) +### High (2) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot | +| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover | -### Medium (0) +### Medium (5) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| CLI-014 | [CLI](CLI/findings.md) | `update` commands require "core" fields, making partial updates impossible | +| CentralUI-021 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callback mutates `Dictionary` off the render thread | +| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard | +| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder | +| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller | -### Low (0) +### Low (10) -_None open._ +| ID | Module | Title | +|----|--------|-------| +| CLI-015 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface has drifted again in two places | +| CLI-016 | [CLI](CLI/findings.md) | `WriteAsTable` derives columns from the first array element only | +| CentralUI-023 | [CentralUI](CentralUI/findings.md) | Residual bare `catch {}` blocks swallow JS interop errors | +| CentralUI-024 | [CentralUI](CentralUI/findings.md) | Claim lookups use magic strings instead of `JwtTokenService` constants | +| CentralUI-025 | [CentralUI](CentralUI/findings.md) | `SessionExpiry` polling/redirect path has no test coverage | +| ClusterInfrastructure-010 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Validator does not enforce `DownIfAlone = true` despite the design doc requiring it | +| Commons-013 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryGetIndex` rejects non-`int` index values | +| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` | +| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name | +| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |