diff --git a/.gitignore b/.gitignore index 28ff025..f493940 100644 --- a/.gitignore +++ b/.gitignore @@ -32,3 +32,7 @@ TestResults/ **/logs/ site_events.db data/ + +# Claude Code local files +.claude/settings.local.json +.claude/scheduled_tasks.lock diff --git a/code-reviews/CLI/findings.md b/code-reviews/CLI/findings.md new file mode 100644 index 0000000..ecceece --- /dev/null +++ b/code-reviews/CLI/findings.md @@ -0,0 +1,442 @@ +# Code Review — CLI + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.CLI` | +| Design doc | `docs/requirements/Component-CLI.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 13 | + +## Summary + +The CLI is a small, well-structured HTTP client over the Management API. The command-tree +construction is consistent and repetitive in a good way: every subcommand funnels through +`CommandHelpers.ExecuteCommandAsync`, which centralizes URL/credential resolution, HTTP +dispatch, and response handling. There are no Akka.NET concerns (the CLI is a pure HTTP +client) and no concurrency-sensitive code apart from the `debug stream` SignalR handler. + +The dominant theme is **graceful-degradation gaps**: several user-supplied inputs (malformed +URLs, malformed `--bindings`/`--overrides` JSON, non-JSON success bodies) are deserialized +or constructed without `try/catch`, so a normal user mistake surfaces as an unhandled +exception with a stack trace instead of a clean error message and exit code 1. A second +theme is **dead configuration**: the `SCADALINK_FORMAT` environment variable and the +`defaultFormat` config-file field are loaded by `CliConfig` but never consulted by any +command, so the documented format-precedence chain does not work. The third theme is +**substantial design-document drift**: `Component-CLI.md` describes a name-keyed, +`--file`-based command surface that bears little resemblance to the implemented +ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliConfig`, and +`CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON +argument parsing, and the command-tree wiring are untested. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). | +| 2 | Akka.NET conventions | ☑ | Not applicable — CLI is a pure HTTP/SignalR client with no Akka.NET runtime (design doc confirms). No issues. | +| 3 | Concurrency & thread safety | ☑ | Only `debug stream` is concurrent; `CancellationTokenSource` is never disposed (CLI-011). Exit-code resolution after Ctrl+C is loose (CLI-012). | +| 4 | Error handling & resilience | ☑ | Unhandled exceptions on malformed URL (CLI-004) and malformed JSON arguments (CLI-005); `StartAsync` cancellation is misreported (CLI-010). | +| 5 | Security | ☑ | `--password` on the command line leaks into process listings / shell history with no env-var or prompt alternative (CLI-006). | +| 6 | Performance & resource management | ☑ | `HttpClient` per invocation is acceptable for a one-shot CLI. `CancellationTokenSource` leak noted in CLI-011. | +| 7 | Design-document adherence | ☑ | `Component-CLI.md` is heavily stale relative to the implemented command surface (CLI-007). | +| 8 | Code organization & conventions | ☑ | Consistent and clean; `CliConfig.DefaultFormat` is loaded but unused (covered by CLI-001). Minor: `--format` not validated (CLI-008). | +| 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). | +| 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). | + +## Findings + +### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:18`, `src/ScadaLink.CLI/Commands/DebugCommands.cs:45`, `src/ScadaLink.CLI/CliConfig.cs:37-39` | + +**Description** + +`CliConfig.Load()` reads `SCADALINK_FORMAT` and the `defaultFormat` config-file field into +`CliConfig.DefaultFormat`, and `Component-CLI.md` documents a format-precedence chain +(command-line option → env var → config file). However, every command resolves the format +with `var format = result.GetValue(formatOption) ?? "json";` and `formatOption` is created +in `Program.cs:11` with `DefaultValueFactory = _ => "json"`. `GetValue` therefore always +returns a non-null value ("json" when the flag is absent), so the `?? "json"` fallback never +fires and `config.DefaultFormat` is never consulted. The env var and config-file format +settings are dead code: `scadalink site list` always outputs JSON regardless of +`SCADALINK_FORMAT=table` or a `defaultFormat` entry in `~/.scadalink/config.json`. The +documented behaviour silently does not work. + +**Recommendation** + +Either remove the `--format` option's `DefaultValueFactory` and have `CommandHelpers` +resolve precedence explicitly (`result.GetValue(formatOption)` → `config.DefaultFormat`), +or detect whether the option was explicitly supplied (`result.GetResult(formatOption)`) and +only then override the config value. Apply the same fix to `DebugCommands.BuildStream`. + +**Resolution** + +_Unresolved._ + +### CLI-002 — Empty success body crashes table rendering with an unhandled exception + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:59-68`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:78-80` | + +**Description** + +`ManagementHttpClient.SendCommandAsync` returns `JsonData = responseBody` for any +success status code, including a 200/204 with an empty body. `HandleResponse` then tests +`response.JsonData != null` — an empty string is non-null — and for `--format table` +calls `WriteAsTable(response.JsonData)`, which immediately does `JsonDocument.Parse(json)`. +`JsonDocument.Parse("")` throws `JsonException`, which is not caught anywhere, so a +command that legitimately returns no body (e.g. a delete that returns 204) terminates with +a stack trace instead of a clean success message. + +**Recommendation** + +In `HandleResponse`, treat a null-or-whitespace `JsonData` as a "command succeeded, no +output" case (print nothing or `(ok)`), and return 0 before attempting to parse. + +**Resolution** + +_Unresolved._ + +### CLI-003 — Non-JSON success body crashes table rendering + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:80` | + +**Description** + +`WriteAsTable` calls `JsonDocument.Parse(json)` with no `try/catch`. If the server returns +a success status but a body that is not valid JSON (a proxy/HTML error page returned with +a 200, a plain-text message, etc.), the CLI throws an unhandled `JsonException`. The +error-path code in `ManagementHttpClient` (lines 52-61) already defensively wraps +`JsonDocument.Parse` in a `try/catch`; the success path and `WriteAsTable` do not get the +same treatment. + +**Recommendation** + +Wrap the `JsonDocument.Parse` in `WriteAsTable` in a `try/catch`; on failure, fall back to +printing the raw body verbatim (as the JSON path already does at line 66). + +**Resolution** + +_Unresolved._ + +### CLI-004 — Malformed `--url` throws an unhandled `UriFormatException` + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CLI/ManagementHttpClient.cs:13` | + +**Description** + +The `ManagementHttpClient` constructor does `new Uri(baseUrl.TrimEnd('/') + "/")` with no +validation. If the user passes a malformed URL (e.g. `--url localhost:9001` without a +scheme, or `--url ""`), `new Uri(...)` throws `UriFormatException`. This call is not +guarded by the `try/catch` in `SendCommandAsync` (it happens in the constructor at +`CommandHelpers.cs:50`), so a common typo terminates the CLI with a stack trace rather +than the documented "connection failure → exit 1 with a descriptive message". + +**Recommendation** + +Validate the URL before constructing the client — e.g. `Uri.TryCreate(url, UriKind.Absolute, out _)` in `CommandHelpers.ExecuteCommandAsync` and `DebugCommands.BuildStream` — and emit a +clean `INVALID_URL` error with exit code 1 on failure. + +**Resolution** + +_Unresolved._ + +### CLI-005 — Malformed `--bindings` / `--overrides` JSON throws unhandled exceptions + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/InstanceCommands.cs:55-58`, `src/ScadaLink.CLI/Commands/InstanceCommands.cs:181-182` | + +**Description** + +`set-bindings` deserializes the `--bindings` argument with +`JsonSerializer.Deserialize>>(...)` and then indexes `p[0]`/`p[1]` +and calls `p[0].GetString()!` / `p[1].GetInt32()`. `set-overrides` deserializes `--overrides` +with `JsonSerializer.Deserialize>(...)`. None of this is wrapped +in a `try/catch`. Invalid JSON throws `JsonException`; a pair with fewer than two elements +throws `ArgumentOutOfRangeException`; a non-string/non-int element throws `InvalidOperationException`. All of these surface as raw stack traces, so a user typo in a JSON argument +crashes the CLI instead of producing a clean validation error and exit code 1. + +**Recommendation** + +Wrap the parsing in `try/catch (JsonException ...)` (and guard the pair length / element +kinds), and on failure call `OutputFormatter.WriteError(...)` with an `INVALID_ARGUMENT` +code and return 1. + +**Resolution** + +_Unresolved._ + +### CLI-006 — Password is passed as a command-line argument with no safer alternative + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.CLI/Program.cs:9`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:36-44` | + +**Description** + +Credentials are supplied only via `--username` / `--password`. A password on the command +line is visible to any local user via the process list (`ps`, `/proc//cmdline`) and is +typically persisted into shell history. Unlike the management URL — which can also come +from `SCADALINK_MANAGEMENT_URL` or the config file — there is no environment-variable +fallback, no `--password-stdin`, and no interactive prompt for the password. For a tool +explicitly intended for CI/CD automation this materially increases the chance of credential +leakage. + +**Recommendation** + +Add a `SCADALINK_PASSWORD` environment variable fallback and/or a `--password-stdin` +option (read the password from stdin), and document that `--password` on the command line +is discouraged. Optionally prompt interactively when stdin is a TTY and no password was +supplied. + +**Resolution** + +_Unresolved._ + +### CLI-007 — `Component-CLI.md` command surface is substantially stale + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `docs/requirements/Component-CLI.md:51-211` (vs. all files under `src/ScadaLink.CLI/Commands/`) | + +**Description** + +The "Command Structure" section of the design doc no longer matches the implemented CLI. +Examples of the drift: + +- The doc keys most operations by **name** (`template get `, `instance get `, + `site get `); the implementation keys everything by integer **ID** via `--id` + (`TemplateCommands.cs:40`, `InstanceCommands.cs:31`, `SiteCommands.cs:26`). +- The doc shows `template create ... --file ` and `site update --file `; + the implementation has no `--file` option anywhere and instead takes individual flags + (`TemplateCommands.cs:52-72`, `SiteCommands.cs:83-115`). +- The doc lists commands that do not exist (`template diff`, `instance bind-connections`, + `instance assign-area`, `template attribute add --tag-path`, `data-connection assign/unassign`, + `security api-key enable/disable` as separate commands) and omits commands that do exist + (`instance alarm-override set/delete/list`, `external-system method` subgroup). +- The doc's `notification smtp update --file` differs from the implemented + `--server/--port/--auth-mode/--from-address` flags (`NotificationCommands.cs:72-94`). +- The doc uses `--site` for site identification in several places where the implementation + uses `--site-id` or `--identifier`. + +A reader following the design doc would be unable to drive the CLI. + +**Recommendation** + +Regenerate the "Command Structure" section of `Component-CLI.md` from the actual command +tree (the in-repo `src/ScadaLink.CLI/README.md` is much closer to reality and could be the +source), or mark the doc's command list as illustrative and point to the README as +authoritative. + +**Resolution** + +_Unresolved._ + +### CLI-008 — `--format` value is not validated + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.CLI/Program.cs:10-11`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:60` | + +**Description** + +The `--format` option accepts any string. `HandleResponse` only checks +`string.Equals(format, "table", ...)`; any other value — including a typo like +`--format tabel` or `--format xml` — silently falls through to JSON output. The user gets +no feedback that their requested format was not honoured. + +**Recommendation** + +Restrict the option to the accepted values, e.g. `formatOption.AcceptOnlyFromAmong("json", "table")`, so `System.CommandLine` rejects invalid input with a clear parse error. + +**Resolution** + +_Unresolved._ + +### CLI-009 — Exit-code documentation does not match `HandleResponse` behaviour + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `docs/requirements/Component-CLI.md:238-249`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:75` | + +**Description** + +The design doc's Exit Codes table defines code 2 as "Authorization failure (insufficient +role)" and the Error Handling section says "If the server returns HTTP 403, the CLI exits +with code 2." `HandleResponse` implements `return response.StatusCode == 403 ? 2 : 1;`, +which is correct for the HTTP error path. However, the `NO_URL`, `NO_CREDENTIALS`, +`INVALID_OPERATION` (from `set-bindings`/`set-overrides`) and any other client-side failure +all return 1, and a connection failure carries `StatusCode == 0` — none of which the doc +enumerates. More importantly, an authorization failure that the server signals with a body +`code` of `UNAUTHORIZED` but an HTTP status other than 403 would be classified as a generic +error (exit 1). The mapping is purely status-driven and the doc does not state that. + +**Recommendation** + +Either document precisely that exit code 2 is determined solely by HTTP 403, or key the +"authorization failure" exit code off the response `code` field as well. Align the doc +with whichever is chosen. + +**Resolution** + +_Unresolved._ + +### CLI-010 — `debug stream` reports Ctrl+C during connect as a connection failure + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/DebugCommands.cs:181-189` | + +**Description** + +`StreamDebugAsync` calls `await connection.StartAsync(cts.Token)` inside a +`try { } catch (Exception ex)` that unconditionally reports +`"Connection failed: {ex.Message}"` with code `CONNECTION_FAILED` and returns 1. If the +user presses Ctrl+C while the connection is still being established, `cts` is cancelled and +`StartAsync` throws `OperationCanceledException`; this is caught by the generic handler and +misreported as a connection failure (with exit code 1) rather than a clean user-initiated +cancellation (exit code 0). + +**Recommendation** + +Catch `OperationCanceledException` separately (return 0 quietly) before the generic +`catch (Exception)` handler, mirroring how the `exitTcs.Task.WaitAsync(cts.Token)` path at +lines 209-215 already treats cancellation as graceful. + +**Resolution** + +_Unresolved._ + +### CLI-011 — `CancellationTokenSource` in `debug stream` is never disposed + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/DebugCommands.cs:89` | + +**Description** + +`var cts = new CancellationTokenSource();` is created in `StreamDebugAsync` but never +disposed; there is no `using` declaration and no explicit `Dispose()` call on any exit +path. `CancellationTokenSource` owns a `WaitHandle` and should be disposed. The impact is +small because the process exits shortly after, but it is an `IDisposable` left undisposed, +contrary to the review checklist's resource-management expectation. + +**Recommendation** + +Declare it as `using var cts = new CancellationTokenSource();` (or wrap the method body in +a `try/finally`). + +**Resolution** + +_Unresolved._ + +### CLI-012 — `debug stream` exit code is unreliable after stream termination + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CLI/Commands/DebugCommands.cs:208-227` | + +**Description** + +After `await exitTcs.Task.WaitAsync(cts.Token)`, the method returns +`exitTcs.Task.IsCompletedSuccessfully ? exitTcs.Task.Result : 0`. When the user cancels +with Ctrl+C, `WaitAsync` throws `OperationCanceledException` and `exitTcs` is typically +still incomplete, so the method returns 0 — correct. However, the `OnStreamTerminated` +handler and the `Closed` handler both call `exitTcs.TrySetResult`, and these run on +SignalR callback threads concurrently with the Ctrl+C path. If a stream termination and a +Ctrl+C race, the final exit code depends on which `TrySetResult` won and whether +`WaitAsync` observed completion before cancellation — the result is not deterministic. A +stream the server terminated abnormally can end up returning 0. + +**Recommendation** + +Resolve the exit code from a single authoritative source: after the `try/catch` around +`WaitAsync`, check `exitTcs.Task` completion explicitly and treat a Ctrl+C with no prior +result as 0, but always prefer a result that was set by `OnStreamTerminated`/`Closed`. +Consider awaiting `exitTcs.Task` without the cancellation token after a brief grace period. + +**Resolution** + +_Unresolved._ + +### CLI-013 — HTTP client, `debug stream`, and JSON-argument parsing are untested + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.CLI.Tests/` (vs. `src/ScadaLink.CLI/ManagementHttpClient.cs`, `src/ScadaLink.CLI/Commands/DebugCommands.cs`, `src/ScadaLink.CLI/Commands/InstanceCommands.cs:55-58`) | + +**Description** + +The test project covers `OutputFormatter`, `CliConfig.Load`, and +`CommandHelpers.HandleResponse`. It does not cover: + +- `ManagementHttpClient.SendCommandAsync` — the timeout (504), connection-failure (code 0), + and error-body-parsing paths are untested. +- The `debug stream` SignalR command — no tests at all. +- The JSON-argument parsing in `InstanceCommands` (`set-bindings`, `set-overrides`) — the + paths most likely to crash on bad input (CLI-005) have no coverage. +- Command-tree wiring — there is no test asserting that each `Build` produces the expected + subcommands/options or that the command-name derivation + (`ManagementCommandRegistry.GetCommandName`) resolves for every command type the CLI + constructs. + +**Recommendation** + +Add tests for `ManagementHttpClient` (using a stub `HttpMessageHandler`), for the +JSON-argument parsing helpers (extracting the parsing into testable methods), and a +smoke test that walks the root command tree and asserts every leaf command's payload type +resolves via `ManagementCommandRegistry`. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/CentralUI/findings.md b/code-reviews/CentralUI/findings.md new file mode 100644 index 0000000..9573755 --- /dev/null +++ b/code-reviews/CentralUI/findings.md @@ -0,0 +1,633 @@ +# Code Review — CentralUI + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.CentralUI` | +| Design doc | `docs/requirements/Component-CentralUI.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 19 | + +## Summary + +The Central UI is a sizeable, generally well-structured Blazor Server module: +custom Bootstrap components only (no third-party UI frameworks, as required), +consistent list/form page patterns, careful disposal in most components, and a +thoughtful Roslyn-backed script editor. The most serious problem is the +**Test Run sandbox** (`ScriptAnalysisService.RunInSandboxAsync`): it compiles +and executes arbitrary user C# *in the central process* with no enforcement of +the documented script trust model — the forbidden-API list is only a Monaco +editor diagnostic, never applied before execution — so a Design user can run +`System.IO`/`Process`/`Reflection` code on the central node. Several other +themes recur: (1) per-circuit security drift — site-scoped Deployment claims +are written at login but never read, so site scoping is not enforced anywhere; +(2) Blazor render-thread and disposal hazards — background `Timer` / `Task.Delay` +callbacks and stream callbacks touch component state and `@ref` children that +may already be disposed; (3) process-global mutation (`Console.SetOut`) shared +across concurrent circuits; (4) drift from the design doc on session expiry and +on the "deployment status pushes via SignalR" claim (the page actually polls). +Testing coverage is thin for a module this large: only the script analyzer, +TreeView, schema model, and a few data-connection pages have unit tests; most +pages and the auth bridge are untested. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | DebugView cap logic, audit-log timezone, toast race — see findings. | +| 2 | Akka.NET conventions | ☑ | Module is mostly UI; `DebugStreamService` actor usage reviewed (in Communication but driven from here). No actor-convention violations in CentralUI proper. | +| 3 | Concurrency & thread safety | ☑ | `Console.SetOut` global mutation, stream/timer callbacks on non-render threads, toast `_ = Task.Delay`. | +| 4 | Error handling & resilience | ☑ | Broad `catch {}` swallowing, dangling `TaskCompletionSource` on dialog disposal. | +| 5 | Security | ☑ | Sandbox not enforcing trust model (Critical); site scoping never enforced; auth bridge reads stale HttpContext; logout CSRF. | +| 6 | Performance & resource management | ☑ | N+1 site-connection query, repeated `FilteredMessages` recomputation, full-page paginators rendering all page buttons. | +| 7 | Design-document adherence | ☑ | Session expiry diverges from "15-min sliding + 30-min idle"; Deployments polls despite "push via SignalR"; nav exposes Deployment-only pages to all roles. | +| 8 | Code organization & conventions | ☑ | Generally good; options classes absent (no appsettings binding here); no major violations. | +| 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. | +| 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. | + +## Findings + +### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement + +| | | +|--|--| +| Severity | Critical | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:171-424` | + +**Description** + +`RunInSandboxAsync` compiles user-supplied script code with `CSharpScript.Create` +and executes it (`script.RunAsync`) directly inside the central process. The +"sandbox" applies only a wall-clock timeout and an output-size cap. It does +**not** enforce the documented script trust model: the forbidden-API set +(`System.IO`, `System.Diagnostics`/`Process`, `System.Reflection`, `System.Net`, +threading) is checked only in `FindForbiddenApiUsages`, which feeds Monaco +editor diagnostics — it is never consulted before `RunInSandboxAsync` executes. +`DefaultOptions` references `typeof(object).Assembly` (the full BCL), so a +Design-role user can submit `System.IO.File.WriteAllText(...)`, +`System.Diagnostics.Process.Start(...)`, reflection, or raw socket code via +`POST /api/script-analysis/run` and it runs with the central host process's +full privileges. The endpoint is gated only by `RequireDesign`. This is a +remote code execution path on the central cluster node. + +**Recommendation** + +Before executing, run the same forbidden-API analysis used for diagnostics and +reject any script with a `SCADA001`/`SCADA002` (severity-8) marker; additionally +restrict the compilation's metadata references to the curated script API +surface, and ideally execute in an isolated `AssemblyLoadContext`/process with +constrained permissions. Treat the trust model as an execution-time gate, not +an editor hint. + +**Resolution** + +_Unresolved._ + +### CentralUI-002 — Site-scoped Deployment permissions are issued but never enforced + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Auth/AuthEndpoints.cs:63-69`; `src/ScadaLink.CentralUI/Components/Pages/Deployment/*.razor` | + +**Description** + +Login adds `SiteId` claims (`JwtTokenService.SiteIdClaimType`) for non-system-wide +Deployment users, and the design doc (Component-CentralUI "Responsibilities" and +CLAUDE.md Security & Auth) requires the Deployment role to be site-scoped. A +repo-wide search shows the `SiteId` claim is written at login and **never read +anywhere in CentralUI**. Deployment pages — `DebugView.razor`, `Deployments.razor`, +`InstanceCreate.razor`, `InstanceConfigure.razor`, `Topology.razor`, +`ParkedMessages.razor`, `EventLogs.razor` — list and act on every site with no +filtering by the user's permitted sites. A Deployment user scoped to one site +can deploy to, debug, and manage instances at any site. + +**Recommendation** + +Enforce site scoping: filter site/instance lists by the user's `SiteId` claims +(or treat the absence of `SiteId` claims as system-wide), and re-check the claim +server-side before any mutating cross-site command (deploy, enable/disable/delete, +debug stream, parked-message retry/discard). A shared helper that reads the +claims from `AuthenticationStateProvider` and exposes "permitted site ids" would +keep this consistent. + +**Resolution** + +_Unresolved._ + +### CentralUI-003 — `Console.SetOut`/`SetError` mutates process-global state across concurrent circuits + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:359-423` | + +**Description** + +`RunInSandboxAsync` redirects `Console.Out`/`Console.Error` to a per-call +`StringWriter`, runs the script, then restores them in `finally`. `Console.Out` +is process-global. If two users (two Blazor circuits) run Test Run concurrently, +their captured outputs interleave or cross over, and the `finally` of whichever +finishes first restores `Console.Out` to the *original* writer while the other +run is still executing — so the second run's script output is lost or written +to the real console. `RunInSandboxAsync` is `async` and the script runs on a +thread-pool thread, so concurrent execution is fully expected. + +**Recommendation** + +Do not redirect process-global `Console`. Provide console capture through the +script globals surface (e.g. a `TextWriter` exposed on `SandboxScriptHost` that +the sandbox API writes to), or serialize Test Run executions with a semaphore if +global redirection must be kept. Capturing per-call without global mutation is +the correct fix. + +**Resolution** + +_Unresolved._ + +### CentralUI-004 — `CookieAuthenticationStateProvider` reads `HttpContext` for the life of the circuit + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:22-28` | + +**Description** + +`GetAuthenticationStateAsync` returns `_httpContextAccessor.HttpContext?.User`. +In Blazor Server, `HttpContext` is only valid during the initial HTTP request +that establishes the circuit; for the lifetime of the long-lived SignalR circuit +`IHttpContextAccessor.HttpContext` is `null` (or, worse, a stale/foreign context +if the accessor's `AsyncLocal` leaks). Any later call to +`GetAuthenticationStateAsync` — e.g. an `` re-evaluating, or pages +that call it directly (`Sites.razor`, `Templates.razor`) — then sees an +unauthenticated principal and may render the wrong UI, or returns a stale +identity that never reflects role changes. The class derives from +`ServerAuthenticationStateProvider`, which is designed to be seeded once via +`SetAuthenticationState`; overriding `GetAuthenticationStateAsync` to read +`HttpContext` defeats that design. + +**Recommendation** + +Capture the authenticated principal once when the circuit is created (e.g. via +the root component / `AuthenticationStateProvider` seeding pattern used by the +Blazor Web App template) and store it on the scoped provider, instead of reading +`IHttpContextAccessor` on every call. Do not depend on `HttpContext` after the +circuit is established. + +**Resolution** + +_Unresolved._ + +### CentralUI-005 — Session expiry implementation diverges from the documented policy + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Auth/AuthEndpoints.cs:47-81`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:18-30` | + +**Description** + +CLAUDE.md (Security & Auth) specifies "15-minute expiry with sliding refresh, +30-minute idle timeout." `AuthEndpoints` instead sets a single fixed +`expires_at = UtcNow + 30 minutes` claim and a 30-minute cookie `ExpiresUtc`, +with no sliding refresh and no separate idle vs absolute timeout. +`SessionExpiry.razor` schedules a single hard redirect at that fixed time. The +result is a hard 30-minute cap with no sliding renewal — an active user is +logged out mid-session, and there is no 15-minute component at all. + +**Recommendation** + +Either implement the documented policy (sliding 15-minute token with refresh on +activity, plus a 30-minute idle cutoff) or update the design docs to match the +fixed 30-minute model. The code and the documented decision must agree. + +**Resolution** + +_Unresolved._ + +### CentralUI-006 — Deployment status page polls every 10s despite the documented SignalR-push design + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:196-216` | + +**Description** + +Component-CentralUI "Real-Time Updates" states: "Deployment status: +Pending/in-progress/success/failed transitions push to the UI immediately via +SignalR (built into Blazor Server). No polling required for deployment +tracking." `Deployments.razor` instead runs a `Timer` that reloads all +deployment records and instance names from the database every 10 seconds. This +is a full N-record + instance-map reload per tick for every open circuit, and +contradicts the design. It also re-issues two repository round-trips on each +tick regardless of whether anything changed. + +**Recommendation** + +Implement push-based updates (an injected event/observable raised by the +Deployment Manager that the page subscribes to and renders via +`InvokeAsync(StateHasChanged)`), or amend the design doc to acknowledge polling. +If polling is kept as a fallback, fetch only changed/in-progress records. + +**Resolution** + +_Unresolved._ + +### CentralUI-007 — Monitoring nav links to Deployment-only pages are shown to all roles + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:69-78`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/EventLogs.razor:2`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:2` | + +**Description** + +`NavMenu` renders the "Event Logs" and "Parked Messages" links inside the +all-authenticated-users Monitoring section. The design doc classifies both the +Site Event Log Viewer and Parked Message Management as **Deployment Role**. +Two inconsistencies result: (a) an Admin- or Design-only user sees nav links +they cannot use; (b) the pages themselves are annotated only `[Authorize]` +(any authenticated user), not `[Authorize(Policy = RequireDeployment)]`, so a +non-Deployment user who follows the link is *not* blocked — they can query site +event logs and retry/discard parked messages. The authorization attribute and +the nav visibility both contradict the design. + +**Recommendation** + +Add `[Authorize(Policy = AuthorizationPolicies.RequireDeployment)]` to +`EventLogs.razor` and `ParkedMessages.razor`, and move their nav links into a +`` block (consistent with the Topology +/ Deployments / Debug View links). Confirm Health Dashboard is intentionally +all-roles (it is, per the design). + +**Resolution** + +_Unresolved._ + +### CentralUI-008 — Audit-log date filters treat browser-local datetimes as UTC + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/AuditLog.razor:242-243` | + +**Description** + +The `From`/`To` filters bind `` to `DateTime?` +fields. A `datetime-local` input yields the value the user typed in their +*browser-local* time zone. `FetchPage` converts them with +`new DateTimeOffset(_filterFrom.Value, TimeSpan.Zero)` — i.e. it labels the +local wall-clock value as UTC. For any non-UTC user the audit query window is +shifted by their UTC offset, silently returning the wrong rows. CLAUDE.md +mandates UTC throughout, but that requires converting the local input *to* UTC, +not relabelling it. + +**Recommendation** + +Convert the picked local time to UTC before querying — capture the browser +offset (JS interop) and apply it, or document the inputs as UTC and label them +in the UI. The same issue should be checked in `EventLogs.razor` if it has +time-range filters. + +**Resolution** + +_Unresolved._ + +### CentralUI-009 — `DebugView` stream callbacks touch a possibly-disposed `ToastNotification` + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:400-409,538-544` | + +**Description** + +The `onTerminated` callback passed to `DebugStreamService.StartStreamAsync` +captures `_toast` and `this` and runs on an Akka/gRPC thread. If the user +navigates away, `Dispose()` calls `StopStream`, but a stream-termination event +already in flight can still invoke `onTerminated`, which calls +`_toast.ShowError(...)` and `StateHasChanged()` on a disposed component. The +component does not guard callbacks with a disposed flag or a +`CancellationTokenSource`. The same applies to the `onEvent` callbacks at +lines 391-398 that call `InvokeAsync(StateHasChanged)`. + +**Recommendation** + +Track a `_disposed`/`CancellationTokenSource` on the component, check it at the +top of every stream callback, and stop the stream synchronously before marking +disposed. `InvokeAsync` after disposal throws `ObjectDisposedException`; the +callbacks should no-op once disposed. + +**Resolution** + +_Unresolved._ + +### CentralUI-010 — `ToastNotification` auto-dismiss continuation runs after component disposal + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Shared/ToastNotification.razor:62-71,90` | + +**Description** + +`AddToast` schedules `Task.Delay(dismissMs).ContinueWith(...)` with the result +discarded (`_ =`). The continuation calls `InvokeAsync(StateHasChanged)`. If the +host page is disposed before the 5-second delay elapses (common — navigate away +right after an action), the continuation runs against a disposed component and +`InvokeAsync` throws `ObjectDisposedException` on a thread-pool thread with no +catch, producing an unobserved task exception. `Dispose()` is an empty body and +cancels nothing. + +**Recommendation** + +Hold a `CancellationTokenSource`, pass its token to `Task.Delay`, cancel it in +`Dispose()`, and guard the continuation. Alternatively wrap the continuation +body in a try/catch for `ObjectDisposedException`. + +**Resolution** + +_Unresolved._ + +### CentralUI-011 — `DiffDialog` leaves a dangling `TaskCompletionSource` when disposed while open + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:89-95,151-157` | + +**Description** + +`OpenAsync` creates `_tcs` and returns `_tcs.Task` to the caller, which +typically `await`s it. The task is completed only by `Close()`. If the user +navigates away while the dialog is open, `DisposeAsync` runs but never completes +`_tcs`, so the awaiting caller's continuation never resumes — a permanently +suspended `Task` (and any `using`/cleanup after the await is skipped). The +`IDialogService.Confirm/Prompt` path has the same shape but at least its host +is a single long-lived `DialogHost`; `DiffDialog` is per-page. + +**Recommendation** + +In `DisposeAsync`, call `_tcs?.TrySetResult(false)` (or `TrySetCanceled`) so any +awaiter completes deterministically. + +**Resolution** + +_Unresolved._ + +### CentralUI-012 — N+1 query loading data connections for the Sites page + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Pages/Admin/Sites.razor:196-205` | + +**Description** + +`LoadDataAsync` fetches all sites, then issues +`SiteRepository.GetDataConnectionsBySiteIdAsync(site.Id)` once per site in a +loop. With N sites this is N+1 database round-trips on every page load and every +post-delete refresh. The connection lists are only used for a small per-card +summary. + +**Recommendation** + +Add a repository method that returns all data connections (or connections for a +set of site ids) in one query and group them client-side, or project the small +summary in a single query. + +**Resolution** + +_Unresolved._ + +### CentralUI-013 — `ScriptAnalysisService` blocks on async shared-script lookups + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:951-952` | + +**Description** + +`ResolveCalledShape` calls `_sharedScripts.GetShapesAsync().GetAwaiter().GetResult()` +to resolve a shared-script shape synchronously. `GetShapesAsync` ultimately hits +`SharedScriptService` and its EF Core repository. Sync-over-async on a request +thread risks thread-pool starvation under load and can deadlock if any awaited +continuation needs a captured context. `Hover` and `SignatureHelp` (which call +`ResolveCalledShape`) are themselves synchronous methods, so the blocking call +is structural. + +**Recommendation** + +Make `Hover` and `SignatureHelp` async and `await` `GetShapesAsync`, or have the +catalog expose a cached synchronous snapshot that is refreshed asynchronously. +The `IMemoryCache` is already present — caching the shapes there and reading +them synchronously would remove the blocking call. + +**Resolution** + +_Unresolved._ + +### CentralUI-014 — Test Run side effects (HTTP/SQL/SMTP) fire against production services + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:254-259`; `src/ScadaLink.CentralUI/ScriptAnalysis/SandboxHostHelpers.cs:26-117` | + +**Description** + +By design (documented in the XML comments) Test Run wires `ExternalSystem`, +`Database`, and `Notify` to central's *real* `IExternalSystemClient`, +`IDatabaseGateway`, and `INotificationDeliveryService`, so a Test Run that calls +`Notify.To(...).Send(...)` actually emails recipients, `Database.Connection(...)` +opens a real DB connection, and `External.Call(...)` makes real HTTP calls — +with production-equivalent side effects. There is no dry-run mode, no +confirmation, and (combined with CentralUI-001) no restriction on what a script +can do. A Design user testing a draft script can dispatch real notifications or +mutate external databases. The behaviour is intentional but the blast radius is +not surfaced to the user. + +**Recommendation** + +At minimum, surface a clear warning in the Test Run UI that side effects are +real, and require explicit opt-in for side-effecting calls. Preferably offer a +dry-run mode that stubs the helpers, defaulting to dry-run. + +**Resolution** + +_Unresolved._ + +### CentralUI-015 — `DialogService` continuations resolve off the render thread + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/ServiceCollectionExtensions.cs:24`; `src/ScadaLink.CentralUI/Components/Shared/DialogService.cs:18-69` | + +**Description** + +`DialogService` is `AddScoped` (one per circuit, correct) but +`ConfirmAsync`/`PromptAsync` complete via `ContinueWith(..., TaskScheduler.Default)`, +so a caller awaiting them resumes on a thread-pool thread. Any subsequent +component state mutation by the caller is then off the render thread unless the +caller wraps it in `InvokeAsync`. Call sites are not consistently doing so, +which can produce non-deterministic render glitches. + +**Recommendation** + +Either resolve continuations on the circuit's sync context or document that +callers must `InvokeAsync` after awaiting `ConfirmAsync`/`PromptAsync`. Audit +call sites for off-thread state mutation. + +**Resolution** + +_Unresolved._ + +### CentralUI-016 — Pagers render one button per page with no windowing + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Shared/DataTable.razor:62-68`; `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:167-173` | + +**Description** + +The `DataTable` and `Deployments` paginators loop `for i = 1..totalPages` and +emit a `
  • ` button for every page. With a few thousand records at page size 25 +that is hundreds of buttons rendered into the diff on every state change. It is +not a correctness bug but degrades render performance and usability on large +datasets. + +**Recommendation** + +Window the pager (first / prev / a few around current / next / last) or switch +large lists to a "load more" / numeric jump input. + +**Resolution** + +_Unresolved._ + +### CentralUI-017 — `/auth/logout` POST disables antiforgery, enabling logout CSRF + +| | | +|--|--| +| Severity | Low | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Auth/AuthEndpoints.cs:127-138` | + +**Description** + +The `POST /auth/logout` endpoint calls `.DisableAntiforgery()`, and a plain +`GET /logout` endpoint also signs the user out. Either can be triggered +cross-site (an `` or an auto-submitting form) to forcibly log +a user out. Login itself reasonably disables antiforgery (pre-auth), but logout +is a state-changing authenticated action and should be CSRF-protected. + +**Recommendation** + +Require an antiforgery token on `POST /auth/logout` (the `NavMenu` sign-out form +can include the antiforgery token), and remove or protect the state-changing +`GET /logout` route. + +**Resolution** + +_Unresolved._ + +### CentralUI-018 — Broad `catch {}` blocks swallow JS interop and storage errors silently + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.CentralUI/Components/Shared/MonacoEditor.razor:116-118,123,142,164,170,176,182,189`; `src/ScadaLink.CentralUI/Components/Shared/TreeView.razor:129,139`; `src/ScadaLink.CentralUI/Components/Pages/Admin/Sites.razor:316-319` | + +**Description** + +Numerous `try { ... } catch { }` blocks swallow every exception with no logging. +The prerender-time JS-unavailable case is legitimate, but these catches also +hide real failures: a genuine Monaco init failure, or a clipboard permission +error become invisible. In `TreeView.razor` the storage-restore +`JsonSerializer.Deserialize` (line 139) is not inside a try at all and would +throw uncaught on a corrupt `treeviewStorage` payload. Debugging UI issues in +production is then guesswork. + +**Recommendation** + +Catch the specific expected exception type (e.g. `JSDisconnectedException`, +`InvalidOperationException` during prerender) and log anything else via +`ILogger`. Wrap the TreeView storage `Deserialize` in its own guarded block. + +**Resolution** + +_Unresolved._ + +### CentralUI-019 — Sparse unit-test coverage for a large module; critical paths untested + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.CentralUI.Tests/` | + +**Description** + +The module has ~65 source files but unit tests cover only the script analyzer, +TreeView, schema model, and two data-connection pages. Untested critical paths +include: the auth bridge (`CookieAuthenticationStateProvider`, +`AuthEndpoints`), `RunInSandboxAsync` (timeout, recursion limit, error +classification, side-effect wiring), `DialogService` resolution semantics, +`DebugView` stream lifecycle and the `UpsertWithCap` cap logic, `Health` and +`Deployments` timer behaviour, and `SchemaBuilderModel` round-tripping of nested +schemas. Given findings CentralUI-001/003/009/010 sit on untested code, the gap +is material. The Playwright suite covers login and navigation only. + +**Recommendation** + +Add bUnit/unit tests for the auth bridge, sandbox-run behaviour (including +forbidden-API rejection once CentralUI-001 is fixed), dialog resolution, and the +DebugView cap/lifecycle logic. Prioritise the paths named in the Critical/High +findings. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/ClusterInfrastructure/findings.md b/code-reviews/ClusterInfrastructure/findings.md new file mode 100644 index 0000000..16a7087 --- /dev/null +++ b/code-reviews/ClusterInfrastructure/findings.md @@ -0,0 +1,313 @@ +# Code Review — ClusterInfrastructure + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.ClusterInfrastructure` | +| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 8 | + +## Summary + +The ClusterInfrastructure module is currently a **Phase 0 skeleton**. It contains +only two source files: `ClusterOptions.cs`, a plain options POCO, and +`ServiceCollectionExtensions.cs`, whose two registration methods are explicit no-ops. +None of the responsibilities described in `Component-ClusterInfrastructure.md` — +Akka.NET cluster bootstrap, leader election, failover detection, split-brain +resolution, cluster singleton hosting, Windows service lifecycle — are implemented. +There are therefore no correctness, concurrency, or Akka-convention defects to find +in *behaviour*, because there is no behaviour. The findings below instead concern +(a) the large gap between the design doc and the code, (b) the options class missing +the validation, configuration-binding affordances, and coverage of documented +settings that peer modules provide, and (c) the no-op DI extensions silently +returning success, which is a latent reliability hazard once the Host wires this +module in. The dominant theme is **incompleteness**: this module is the foundation +every other component runs on, yet it presently delivers nothing the design requires. +The single options class is clean and its test covers defaults and setters +adequately for what exists. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). | +| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. | +| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. | +| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. | +| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. | +| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. | +| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). | +| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. | +| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). | +| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). | + +## Findings + +### ClusterInfrastructure-001 — Module implements none of its documented responsibilities + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` | + +**Description** + +`Component-ClusterInfrastructure.md` assigns this module seven concrete +responsibilities: bootstrap the Akka.NET `ActorSystem`, form the two-node cluster, +manage leader election / active-standby role assignment, detect node failures and +trigger failover, provide remoting, host the cluster singleton, and manage the +Windows service lifecycle. The entire module is two files: a `ClusterOptions` POCO +and a `ServiceCollectionExtensions` whose methods are explicitly commented +`// Phase 0: skeleton only` and `// Phase 0: placeholder for Akka actor registration` +and simply return the unmodified `IServiceCollection`. There is no `Akka.Cluster`, +`Akka.Cluster.Tools`, `Akka.Remote`, or split-brain-resolver dependency in the +`.csproj` at all (it references only `Microsoft.Extensions.DependencyInjection.Abstractions`, +`Microsoft.Extensions.Options`, and `ScadaLink.Commons`). Because every other +ScadaLink component runs inside the actor system this module is responsible for +creating, the absence of any implementation blocks the foundational layer of the +system. + +**Recommendation** + +Track the gap explicitly (a milestone/issue) and implement the documented behaviour: +add the Akka cluster/remote/cluster-tools and split-brain-resolver package +references, build the cluster bootstrap (HOCON generation from `ClusterOptions`), +the split-brain resolver configuration, cluster-singleton hosting support, and +`CoordinatedShutdown` wiring. Until then, the module's `Status` and the design doc +should clearly state it is unimplemented so callers do not assume otherwise. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-002 — No-op DI extension methods report success while doing nothing + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:7-17` | + +**Description** + +`AddClusterInfrastructure` and `AddClusterInfrastructureActors` both accept an +`IServiceCollection` and return it unchanged. A caller (e.g. the Host) that invokes +`services.AddClusterInfrastructure()` receives a fluent, success-looking result but +no actor system, no cluster, and no singleton support is actually registered. This +is a silent failure: the system will appear to start, then fail later and far from +the cause (e.g. when a component resolves an `ActorSystem` that was never added, or +when the cluster singleton never forms). A no-op that masquerades as a completed +registration is worse than an unimplemented method that throws. + +**Recommendation** + +Until the real implementation exists, make the placeholder loud rather than silent — +either throw `NotImplementedException` from the methods, or have them log a +prominent warning, so an integrating caller fails fast with a clear cause. Replace +with the genuine registration when CI-001 is addressed. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-003 — ClusterOptions omits several documented node-configuration settings + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` | + +**Description** + +The "Node Configuration", "Split-Brain Resolution", and "Failure Detection Timing" +sections of the design doc enumerate the settings each node needs. `ClusterOptions` +exposes `SeedNodes`, `SplitBrainResolverStrategy`, `StableAfter`, +`HeartbeatInterval`, `FailureDetectionThreshold`, and `MinNrOfMembers`, but is +missing: the Akka remoting hostname/port (default 8081 central, 8082 site), the +cluster role (Central vs. Site) and the site identifier, the `down-if-alone` flag +(the design explicitly requires `down-if-alone = on` for the keep-oldest resolver), +and — for site nodes — the gRPC port (default 8083) and local SQLite storage paths. +Without these, the options class cannot drive a correct HOCON configuration when +CI-001 is implemented. (Some settings such as remoting host/port may instead belong +in `Host/NodeOptions.cs`; the split of ownership should be decided deliberately, but +at minimum `down-if-alone` belongs with the split-brain settings here.) + +**Recommendation** + +Add the missing settings — at minimum a `DownIfAlone` boolean (default `true`) and +the cluster role / site identifier — or document explicitly which settings are +owned by `Host/NodeOptions.cs` instead, so the design doc and the options classes +agree on where each value lives. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-004 — ClusterOptions has no validation despite safety-critical values + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` | + +**Description** + +`ClusterOptions` carries values whose misconfiguration has cluster-wide +consequences. The design doc is emphatic that `min-nr-of-members` must be `1` (a +value of `2` blocks the singleton and therefore all data collection indefinitely +after failover), that `SplitBrainResolverStrategy` must be `keep-oldest` for a +two-node cluster (quorum strategies cause total shutdown), and that the timing +values are interdependent (`HeartbeatInterval` must be well below +`FailureDetectionThreshold`). The class has no data annotations, no +`IValidateOptions`, and no guard logic, so an `appsettings.json` +setting `MinNrOfMembers: 2` or `SplitBrainResolverStrategy: "keep-majority"` (the +exact value the test at `ClusterOptionsTests.cs:35` shows is settable) would be +accepted silently and produce the catastrophic outcomes the design doc warns +against. + +**Recommendation** + +Add validation — data annotations (`[Range]` for `MinNrOfMembers`, etc.) plus an +`IValidateOptions` implementation that enforces +`MinNrOfMembers == 1`, restricts `SplitBrainResolverStrategy` to a known set, +requires `SeedNodes` non-empty, and asserts `HeartbeatInterval < +FailureDetectionThreshold` and positive `StableAfter`. Register it with +`ValidateOnStart()` so misconfiguration fails fast at boot. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-005 — No configuration section name constant for the Options pattern binding + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3` | + +**Description** + +CLAUDE.md specifies per-component configuration via `appsettings.json` sections +bound with the Options pattern. `ClusterOptions` provides no `public const string +SectionName` (or equivalent) for the binding site to reference, so whichever code +binds the section must hard-code the magic string, and there is no single source of +truth for the section name. Because `AddClusterInfrastructure` is itself a no-op +(CI-002), the options class is currently bound nowhere at all, making the missing +constant easy to overlook. + +**Recommendation** + +Add a `public const string SectionName = "Cluster";` (or the agreed name) to +`ClusterOptions` and have the eventual `AddClusterInfrastructure` bind +`configuration.GetSection(ClusterOptions.SectionName)` against it. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-006 — No tests for any cluster behaviour; only the options POCO is covered + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:1-51` | + +**Description** + +The test project contains only `ClusterOptionsTests`, exercising default values and +property setters of `ClusterOptions`. There are no tests for cluster formation, +leader election, failover detection, split-brain resolution, singleton handover, or +the `ServiceCollectionExtensions` registration methods — none can exist because the +behaviour itself is absent (CI-001). This is recorded so the testing gap is tracked +alongside the implementation gap: the most safety-critical paths of the entire +system (failover, split-brain, dual-node recovery) are completely untested. The +test at line 30-50 also asserts that `SplitBrainResolverStrategy` can be set to +`"keep-majority"`, implicitly endorsing a value the design doc forbids for a +two-node cluster — see CI-004. + +**Recommendation** + +When CI-001 is implemented, add multi-node `Akka.Cluster.TestKit` / +`MultiNodeTestKit` tests covering cluster formation, failover promotion, +split-brain downing, and singleton handover, plus unit tests for HOCON generation +from `ClusterOptions` and for the options validation from CI-004. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-007 — ClusterOptions lacks XML documentation comments + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` | + +**Description** + +`ClusterOptions` and each of its six properties have no XML doc comments. Peer +options classes such as `StoreAndForward/StoreAndForwardOptions.cs` document the +class and every property (including units and design-doc references). For a class +whose values carry the cluster-wide consequences described in the design doc +(notably `MinNrOfMembers` and `SplitBrainResolverStrategy`), the absence of inline +documentation is a maintainability and safety gap — a future editor has no in-code +warning that `MinNrOfMembers` must stay `1`. + +**Recommendation** + +Add `` comments to the class and each property, stating units and the +documented constraints (e.g. that `MinNrOfMembers` must be `1`, that +`HeartbeatInterval` must be well below `FailureDetectionThreshold`), referencing +the relevant design-doc sections as peer modules do. + +**Resolution** + +_Unresolved._ + +### ClusterInfrastructure-008 — "Phase 0 skeleton" status is undocumented at the module level + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` | + +**Description** + +The only indication that this foundational module is unimplemented is two inline +comments inside private method bodies (`// Phase 0: skeleton only` / +`// Phase 0: placeholder for Akka actor registration`). There is no module README, +no `` in the design doc, and no tracking marker visible to anyone +reading the project structure or the component table. Given that the design doc +(`Component-ClusterInfrastructure.md`) describes a fully featured component with no +caveat, a reader will reasonably assume the module is built. The mismatch between a +complete-looking design doc and an empty implementation is itself a documentation +defect. + +**Recommendation** + +Add a short note to the design doc (or a module-level `README.md`) stating the +current implementation status and what "Phase 0" delivers, and reference a tracked +issue for the remaining work (CI-001). Keep the README component table accurate +about which components are skeletons versus implemented. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/Commons/findings.md b/code-reviews/Commons/findings.md new file mode 100644 index 0000000..f6e6f56 --- /dev/null +++ b/code-reviews/Commons/findings.md @@ -0,0 +1,448 @@ +# Code Review — Commons + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.Commons` | +| Design doc | `docs/requirements/Component-Commons.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 12 | + +## Summary + +Commons is in good overall health. It is a well-organized, dependency-light library: +the architectural-constraint tests enforce the no-Akka/no-EF/no-ASP.NET rule, the +POCO-entity and message-as-record conventions, and the UTC timestamp rule. The folder +and namespace hierarchy closely matches REQ-COM-5b. No Critical issues were found. + +The findings cluster around three themes. First, a handful of files quietly stretch +the REQ-COM-6 "no business logic" boundary — `StaleTagMonitor`, `OpcUaEndpointConfigSerializer`, +`OpcUaEndpointConfigValidator`, `ScriptParameters`, `ValueFormatter`, `DynamicJsonElement` +and `ScriptArgs` all carry non-trivial behavior, and a couple have real correctness or +concurrency defects (the `StaleTagMonitor` stale-fire race, the `DynamicJsonElement` +`JsonDocument`-lifetime hazard, the silent conversion-failure swallowing in +`ScriptParameters.GetNullable`). Second, the `ManagementCommandRegistry` name mapping is +asymmetric and namespace-scoped in a way that does not match the broader set of +`*Command` records elsewhere in `Messages/`. Third, several behavior-bearing types +(`ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, +`Result`, the OPC UA serializer round-trip) have no unit tests despite containing the +kind of edge-case logic that warrants them. Entity and message contracts otherwise look +clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a +wire command. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result` allows null error; legacy-config fallback loses data. | +| 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. | +| 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). | +| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). | +| 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. | +| 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). | +| 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. | +| 8 | Code organization & conventions | ✓ | `ManagementCommandRegistry` naming is asymmetric/namespace-scoped (Commons-004). `DeployedConfigSnapshot`, `InstanceAlarmOverride`, `TemplateFolder`, `ISiteRepository`, several service interfaces and `Messages/Management` exist but are not listed in Component-Commons.md (Commons-009). | +| 9 | Testing coverage | ✓ | `ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, `Result`, `ConfigurationDiff`, `AlarmContext`, and the OPC UA serializer round-trip have no tests (Commons-010). | +| 10 | Documentation & comments | ✓ | `OpcUaEndpointConfigSerializer.Deserialize` XML doc does not mention the silent data-loss path (Commons-005). `Component-Commons.md` is stale relative to the actual file set (Commons-009). `ValueFormatter` uses current-culture formatting without documenting it (Commons-012). | + +## Findings + +### Commons-001 — `StaleTagMonitor` stale-fire race between timer and `OnValueReceived` + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/StaleTagMonitor.cs:42-46`, `:62-67` | + +**Description** + +`OnValueReceived` sets `_staleFired = false` then calls `_timer.Change(...)`, while the +timer callback `OnTimerElapsed` reads `_staleFired`, sets it to `true`, and invokes the +`Stale` event. `_staleFired` is `volatile`, which guarantees visibility but not +atomicity of the check-then-set. The two methods run on different threads (a value- +arrival thread and a `ThreadPool` timer thread). If the timer callback has already +passed the `if (_staleFired) return;` check when `OnValueReceived` runs, `Stale` fires +even though a fresh value just arrived — a spurious staleness signal. There is also a +window where `OnValueReceived` resets `_staleFired` and reschedules the timer while a +callback for the previous period is mid-flight, so `Stale` can fire once per period as +documented but at the wrong moment. For a heartbeat monitor feeding connection-health +decisions, a false stale signal can trigger an unnecessary reconnect. + +**Recommendation** + +Guard the state transition with a lock, or replace the `_staleFired` bool with an +`Interlocked.CompareExchange` on an `int` so only one of "fire" / "reset" wins. The +callback should atomically test-and-set; `OnValueReceived` should atomically reset and +only then reschedule the timer. + +**Resolution** + +_Unresolved._ + +### Commons-002 — `DynamicJsonElement` retains a `JsonElement` whose `JsonDocument` lifetime it does not own + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:10-17` | + +**Description** + +`DynamicJsonElement` stores a `JsonElement` and exposes it for deferred, dynamic access +from scripts. A `JsonElement` is only valid while the `JsonDocument` that produced it has +not been disposed; accessing a `JsonElement` after its document is disposed throws +`ObjectDisposedException`. Nothing in `DynamicJsonElement` keeps the document alive or +documents that the caller must. Because the wrapper is explicitly designed for +"convenient property access in scripts" — i.e. access at an arbitrary later time — a +caller that wraps an element from a `using var doc = JsonDocument.Parse(...)` block (the +exact pattern used in `OpcUaEndpointConfigSerializer`) will hand scripts a wrapper that +faults on first member access. + +**Recommendation** + +Either clone the element on construction with `JsonElement.Clone()` (which detaches it +from the document and makes it safe to retain), or hold a reference to the owning +`JsonDocument` and implement `IDisposable`. Document the lifetime contract on the type +regardless. + +**Resolution** + +_Unresolved._ + +### Commons-003 — `ScriptParameters.GetNullable` silently swallows conversion failures + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/ScriptParameters.cs:72-86` | + +**Description** + +`GetNullable` catches `ScriptParameterException` from `ConvertScalar` and returns +`default!` (null) "on conversion failure for nullable". This conflates two distinct +cases: a parameter that is genuinely absent/null, and a parameter that is *present but +holds an unconvertible value* (e.g. `Get("count")` when `count` is the string +`"banana"`). The latter is almost always a script or caller bug, and silently mapping it +to `null` hides it — the script then proceeds with a null it interprets as "not +supplied". The non-nullable `Get` and the array/list paths correctly throw with a +descriptive message for the same bad input, so the behavior is also inconsistent across +the API surface. The XML doc states "returns null if missing, null, or unconvertible", +so the behavior is intentional, but it remains a footgun. + +**Recommendation** + +Distinguish "absent/null" from "present but unconvertible": return null only for the +former and throw `ScriptParameterException` for the latter, mirroring the array/list +element handling. If the swallowing must stay for compatibility, at minimum surface it +(e.g. an out-of-band warning) rather than failing silently. + +**Resolution** + +_Unresolved._ + +### Commons-004 — `ManagementCommandRegistry` name mapping is asymmetric and namespace-scoped + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.Commons/Messages/Management/ManagementCommandRegistry.cs:14-35` | + +**Description** + +`BuildRegistry` registers only types in the exact `ScadaLink.Commons.Messages.Management` +namespace whose names end in `Command`. `GetCommandName(Type)`, however, strips a +`Command` suffix from *any* type passed to it. The two halves disagree: + +- `GetCommandName` will happily compute a command name for `*Command` records that live + in other `Messages/` sub-namespaces (`DeployInstanceCommand` in `Messages.Deployment`, + `DisableInstanceCommand` in `Messages.Lifecycle`, `SetStaticAttributeCommand` in + `Messages.Instance`, `DeployArtifactsCommand` in `Messages.Artifacts`, etc.), yet + `Resolve` will return `null` for every one of those names because they were never + registered. +- Because of this gap the Management namespace carries deliberately renamed duplicates + (`MgmtDeployInstanceCommand`, `MgmtEnableInstanceCommand`, `MgmtDisableInstanceCommand`, + `MgmtDeleteInstanceCommand` in `InstanceCommands.cs`) whose `Mgmt` prefix exists only + to dodge a collision the registry's namespace filter already prevents — a confusing, + undocumented coupling. + +A round-trip `Resolve(GetCommandName(t))` is therefore not guaranteed to return `t`, +which is the implicit contract of a name registry. + +**Recommendation** + +Make the two methods symmetric: either scan all of `Messages/` (and detect/throw on +duplicate stripped names, since `ToFrozenDictionary` will throw on a collision) or +restrict `GetCommandName` to types the registry actually contains. Document the chosen +scope, and reconsider whether the `Mgmt*` prefixed duplicates are still needed. + +**Resolution** + +_Unresolved._ + +### Commons-005 — `OpcUaEndpointConfigSerializer.Deserialize` discards malformed legacy input and over-reports `IsLegacy` + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:25-51` | + +**Description** + +When the typed-deserialize path fails or the JSON lacks `endpointUrl`, `Deserialize` +falls through to `LoadLegacy`. If `LoadLegacy` itself throws `JsonException` (genuinely +malformed JSON), the method returns `(new OpcUaEndpointConfig(), IsLegacy: true)` — a +default, empty config with the legacy flag set. The original stored string is silently +discarded, and the caller is told it is a recoverable "legacy" row when in fact the data +was unparseable. A form built on the documented `IsLegacy` contract ("prompt the user to +re-save") will present an empty config as if it were the user's saved configuration, +inviting them to overwrite real (if malformed) data with blanks. The XML doc only +describes the happy legacy path and does not mention this data-loss branch. + +**Recommendation** + +Distinguish "parsed as legacy" from "could not parse at all" — e.g. return a third state +or throw for genuinely malformed input so the caller can surface an error instead of an +empty form. Update the XML doc to describe the failure branch. + +**Resolution** + +_Unresolved._ + +### Commons-006 — `DynamicJsonElement.TryConvert` reports success for unconvertible target types + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:47-51`, `:66-76` | + +**Description** + +`TryConvert` does `result = ConvertTo(binder.Type); return result != null || binder.Type == typeof(object);`. +`ConvertTo` returns `null` for any type/kind pair it does not handle (e.g. requesting +`int` from a JSON string, or `DateTime` from anything). For a non-`object` target this +yields `result == null` and `return false`, which is correct. But the `|| binder.Type == typeof(object)` +clause makes `(object)dynamicElement` succeed with a `null` result even when the wrapped +element is, say, a JSON object or a non-null string — the cast silently produces `null` +instead of the element or its value. Any script doing `object o = jsonThing;` gets `null` +for a present value. The conversion of a present, non-null JSON value should never yield +`null`. + +**Recommendation** + +For the `object` target, return the element itself (or `Wrap(_element)`) rather than +`null`. Only return `null` when the wrapped element is genuinely `JsonValueKind.Null`. + +**Resolution** + +_Unresolved._ + +### Commons-007 — Several Commons types carry non-trivial logic, stretching REQ-COM-6 + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/ScriptParameters.cs`, `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs`, `src/ScadaLink.Commons/Validators/OpcUaEndpointConfigValidator.cs`, `src/ScadaLink.Commons/Types/StaleTagMonitor.cs`, `src/ScadaLink.Commons/Types/ScriptArgs.cs` | + +**Description** + +REQ-COM-6 states Commons "must contain only data structures, interfaces, enums, and +constants" and "must not contain any business logic", with method bodies "limited to +trivial data-access logic". Several files exceed that: `ScriptParameters` performs typed +conversion with reflection and JSON-element unwrapping; `OpcUaEndpointConfigSerializer` +implements a multi-shape (typed + legacy flat-dict) serialization strategy; +`OpcUaEndpointConfigValidator` encodes OPC UA domain rules (e.g. `LifetimeCount` ≥ 3× +`KeepAliveCount`); `StaleTagMonitor` runs a `Timer` and raises events; `ScriptArgs` +reflects over arbitrary objects. The `ArchitecturalConstraintTests` "no service/actor" +heuristic only counts public methods (> 3) and so does not catch these. This is design +drift, not a defect — but it should be a deliberate decision: either move these helpers +into the components that own the behavior (Data Connection Layer, Site Runtime, +Template Engine) or amend Component-Commons.md to explicitly permit "pure stateless +helpers/validators". + +**Recommendation** + +Decide and document the policy. If these are intentionally allowed in Commons, add a +sentence to REQ-COM-6 carving out pure validators/serializers/parsers; otherwise relocate +them. Tighten the architectural test if the rule is meant to be enforced. + +**Resolution** + +_Unresolved._ + +### Commons-008 — `SetConnectionBindingsCommand` uses `ValueTuple` in a wire message contract + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.Commons/Messages/Management/InstanceCommands.cs:10` | + +**Description** + +`SetConnectionBindingsCommand` declares +`IReadOnlyList<(string AttributeName, int DataConnectionId)> Bindings`. The tuple element +names are compile-time-only; `System.Text.Json` serializes a `ValueTuple` as `Item1` / +`Item2`, and the message is positional with no room for additive evolution (you cannot +add a third field without changing the tuple type, which REQ-COM-5a forbids). Every other +message in `Messages/` uses named records. A management command travels over the +ClusterClient boundary and is exactly the kind of contract REQ-COM-5a's additive-only +rule targets. + +**Recommendation** + +Replace the tuple with a small named record, e.g. +`record ConnectionBinding(string AttributeName, int DataConnectionId)`, and use +`IReadOnlyList`. + +**Resolution** + +_Unresolved._ + +### Commons-009 — `Component-Commons.md` is stale relative to the actual file set + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `docs/requirements/Component-Commons.md:61-198` | + +**Description** + +The design doc's entity list, repository list, and folder tree no longer match the code: + +- Entities present but undocumented: `DeployedConfigSnapshot`, `InstanceAlarmOverride`, + `TemplateFolder`. +- Repository interface present but undocumented: `ISiteRepository` (the doc lists seven + repositories under REQ-COM-4; the code has eight). +- Service interfaces present but undocumented: `IDatabaseGateway`, + `IExternalSystemClient`, `IInstanceLocator`, `INotificationDeliveryService` — REQ-COM-4a + documents only `IAuditService`. +- Whole namespaces absent from the REQ-COM-5b folder tree: `Messages/Management`, + `Messages/DataConnection`, `Messages/Integration`, `Messages/Instance`, + `Messages/RemoteQuery`, plus `Types/DataConnections`, `Types/Scripts`, `Serialization/`, + and `Validators/`. + +CLAUDE.md's editing rules require the design docs to stay in sync with the code; the doc +is now a partial map. + +**Recommendation** + +Refresh Component-Commons.md to enumerate the current entities, repository and service +interfaces, and the actual `Types/`, `Messages/`, `Serialization/`, and `Validators/` +folders. + +**Resolution** + +_Unresolved._ + +### Commons-010 — Behavior-bearing Commons types have no unit tests + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.Commons.Tests/` | + +**Description** + +`ScadaLink.Commons.Tests` covers `Result`, `RetryPolicy`, `ScriptParameters`, +`StaleTagMonitor`, the OPC UA validator, enums, message conventions, compatibility, and +entity conventions. It does not cover several types that contain exactly the kind of +edge-case logic that warrants tests: + +- `ValueFormatter` — scalar vs collection vs null formatting. +- `DynamicJsonElement` — member/index access, conversions, the issues in Commons-002 and + Commons-006 would have been caught by tests. +- `ScriptArgs.Normalize` — dictionary/anonymous-object/primitive-rejection paths. +- `ManagementCommandRegistry` — `Resolve` / `GetCommandName` round-trip (would have + surfaced Commons-004). +- `Result` — `Match`, failure/success accessors, error-on-misuse. +- `OpcUaEndpointConfigSerializer` typed↔flat round-trip and legacy fallback. +- `ConfigurationDiff` / `AlarmContext` / `ScriptScope` — minor, but `HasChanges` / + `HasParent` logic is untested. + +**Recommendation** + +Add focused unit tests for the helper/utility types above, prioritizing +`DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, and the OPC UA serializer +round-trip. + +**Resolution** + +_Unresolved._ + +### Commons-011 — `Result.Failure` accepts a null error string + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/Result.cs:15-20`, `:30-32`, `:36` | + +**Description** + +`Result.Failure(string error)` and the private failure constructor do not validate +`error`. A caller passing `null` produces a failed `Result` whose `Error` getter returns +`null` via `_error!`, and whose `Match` calls `onFailure(_error!)` with `null`. `Result` +is the system-wide error-handling type ("consistent error handling across component +boundaries"); a failed result with no error message defeats its purpose and pushes a +`NullReferenceException` risk onto every consumer that logs or displays `Error`. + +**Recommendation** + +Throw `ArgumentNullException` (or `ArgumentException` for empty/whitespace) in +`Failure`/the failure constructor so a failed `Result` always carries a message. + +**Resolution** + +_Unresolved._ + +### Commons-012 — `ValueFormatter` uses current-culture formatting without documenting it + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/ValueFormatter.cs:20-27` | + +**Description** + +`FormatDisplayValue` formats `IFormattable` values (and collection elements) with the +parameterless `ToString()`, which uses the current thread culture. The XML doc calls this +"the value's natural string representation" without noting the culture dependency. The +same numeric or `DateTime` attribute value will render differently depending on the +server/UI locale — e.g. decimal separators, date order. CLAUDE.md mandates UTC for +timestamps and notes local-time conversion is "a UI display concern only"; if +`ValueFormatter` is used outside a UI rendering context (e.g. logging, event-log entries, +diff display) the culture-dependent output is inconsistent and a latent bug. + +**Recommendation** + +Decide whether `ValueFormatter` is a UI-only helper. If it can be used outside the UI, +format with `CultureInfo.InvariantCulture` (using the `IFormattable.ToString(null, IFormatProvider)` +overload). Either way, document the culture behavior on the method. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/Communication/findings.md b/code-reviews/Communication/findings.md new file mode 100644 index 0000000..08e473f --- /dev/null +++ b/code-reviews/Communication/findings.md @@ -0,0 +1,404 @@ +# Code Review — Communication + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.Communication` | +| Design doc | `docs/requirements/Component-Communication.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 11 | + +## Summary + +The Communication module is generally well-structured and matches the design doc's +two-transport model (ClusterClient for command/control, gRPC server-streaming for +real-time data). The actors keep mutable state on the actor thread, use `PipeTo` for +async work, and the gRPC server/client lifecycle is mostly disciplined. However the +review found one Critical issue (a `TimeoutException` from `DebugStreamService` leaves +an orphaned bridge actor and an active site-side subscription, leaking resources on +every snapshot timeout) and several High/Medium issues clustered around two themes: +**(a) gRPC subscription bookkeeping races** — `SiteStreamGrpcClient` overwrites and +removes subscription entries by correlation ID without disposal or ownership checks, +so reconnect cycles leak `CancellationTokenSource`es and can cancel the wrong stream; +and **(b) missing supervision strategy** on the coordinator actors, contrary to the +CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise +good. Test coverage is broad for happy paths but has gaps around failover, cache +mutation races, and the snapshot-timeout cleanup path. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. | +| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. | +| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. | +| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. | +| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). | +| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. | +| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. | +| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. | +| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. | +| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. | + +## Findings + +### Communication-001 — Snapshot timeout leaves orphaned bridge actor and site subscription + +| | | +|--|--| +| Severity | Critical | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.Communication/DebugStreamService.cs:139`, `src/ScadaLink.Communication/DebugStreamService.cs:149` | + +**Description** + +When `StartStreamAsync` times out waiting for the initial snapshot it calls +`StopStream(sessionId)` and throws. `StopStream` only sends `StopDebugStream` to the +bridge actor **if the session is still in `_sessions`**. But the bridge actor was added +to `_sessions` at line 124 and is only removed by `onTerminatedWrapper`. The serious +case is the race where `onTerminatedWrapper` fires first (e.g. site disconnect arrives +during the wait): `snapshotTcs.TrySetException` completes the await with an +`InvalidOperationException` rather than `OperationCanceledException`, which is **not** +caught by the `catch (OperationCanceledException)` block. The exception propagates +uncaught, `StopStream` is never reached, and if the bridge actor is instead orphaned +(snapshot never arrives, site silent, no terminate) the only cleanup is the 5-minute +`ReceiveTimeout` in the actor — meaning a site-side `StreamRelayActor` and gRPC stream +can stay alive for up to 5 minutes after the central caller has given up. Combined with +the 30s timeout, every transient snapshot delay leaks site resources for minutes. + +**Recommendation** + +In `StartStreamAsync`, wrap the `await` so that *any* failure or cancellation +deterministically calls `StopStream(sessionId)` (e.g. `try/catch (Exception)` or a +`finally` that stops the session when the result was not returned). Ensure +`StopStream` is idempotent and always sends `StopDebugStream` even if the session was +already removed, so the bridge actor (and its site-side subscription) is torn down +promptly rather than waiting for the orphan `ReceiveTimeout`. + +**Resolution** + +_Unresolved._ + +### Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143` | + +**Description** + +On a gRPC stream error, `HandleGrpcError` increments the retry count, flips +`_useNodeA`, and schedules `OpenGrpcStream`. `OpenGrpcStream` cancels and disposes +`_grpcCts` and starts a fresh `SubscribeInstance` call — but it never calls +`client.Unsubscribe(_correlationId)` on the *old* node's client, and the site-side +`SiteStreamGrpcServer` keys active streams by `correlation_id` only. Because the new +subscription goes to the *other* node (`_useNodeA` flipped), the old node's +`SiteStreamGrpcServer` still has an active stream + `StreamRelayActor` + +`SiteStreamManager` subscription for that correlation ID. The old node only learns the +client is gone via TCP RST or keepalive — exactly the failure mode that triggered the +reconnect (network partition / silent node), so detection may take ~25s or never. Each +reconnect can therefore leave a zombie relay actor on the failed node. `CleanupGrpc` +(which *does* call `Unsubscribe`) is only invoked on terminal paths, not between +reconnect attempts. + +**Recommendation** + +Before reconnecting in `HandleGrpcError` / at the top of `OpenGrpcStream`, call +`Unsubscribe(_correlationId)` on the client for the *previous* endpoint (the one that +just failed) so the local CTS is cancelled and — where the channel is still alive — +the gRPC cancellation reaches the site and stops the relay actor. + +**Resolution** + +_Unresolved._ + +### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106` | + +**Description** + +`SubscribeAsync` does `_subscriptions[correlationId] = cts;` (line 77), +unconditionally overwriting any existing entry for that correlation ID without +cancelling or disposing the previous `CancellationTokenSource`. The `finally` block +then does `_subscriptions.TryRemove(correlationId, out _)` (line 106) which removes +the entry **by key only, regardless of which CTS is stored**. Because +`DebugStreamBridgeActor` reuses the same `_correlationId` across reconnect attempts +(and `SiteStreamGrpcClientFactory` returns the same `SiteStreamGrpcClient` for a site +even after a node flip), two `SubscribeAsync` calls can briefly share a correlation +ID. The first call's `finally` then removes the *second* call's CTS entry, so a later +`Unsubscribe(correlationId)` finds nothing and the live stream is never cancelled — an +orphan. Conversely the overwritten CTS is leaked (never disposed). + +**Recommendation** + +When inserting, cancel+dispose any prior CTS for that correlation ID. In the `finally`, +remove only if the stored CTS is the one this call created (use the +`TryRemove(KeyValuePair)` overload, mirroring what `SiteStreamGrpcServer` already does +with `StreamEntry`). Consider keying subscriptions by a per-call GUID rather than the +caller-supplied correlation ID. + +**Resolution** + +_Unresolved._ + +### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume) + +| | | +|--|--| +| Severity | Medium | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22` | + +**Description** + +CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for +short-lived execution actors") requires coordinator actors to use an explicit `Resume` +supervision strategy. `CentralCommunicationActor` and `SiteCommunicationActor` are +long-lived coordinators (they own the per-site ClusterClient map, debug +subscriptions, in-progress deployments) but neither overrides `SupervisorStrategy`. +They fall back to the Akka default (`OneForOneStrategy` with `Restart`). A child fault +— e.g. a `ClusterClient` child of `CentralCommunicationActor` created by +`DefaultSiteClientFactory` — would `Restart` under the default strategy, and any +exception in the coordinator itself would restart it, wiping `_siteClients`, +`_debugSubscriptions`, and `_inProgressDeployments` silently. The design intent is +`Resume` so transient child faults do not discard coordinator state. + +**Recommendation** + +Override `SupervisorStrategy` on both actors to return an explicit +`OneForOneStrategy` with `Directive.Resume` (or the project's standard coordinator +strategy), matching the documented decision and other coordinator actors. + +**Resolution** + +_Unresolved._ + +### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ScadaLink.Communication/CommunicationOptions.cs:36` | + +**Description** + +`CommunicationOptions` exposes `GrpcKeepAlivePingDelay`, `GrpcKeepAlivePingTimeout`, +`GrpcMaxStreamLifetime`, and `GrpcMaxConcurrentStreams`, and the design doc's +"gRPC Connection Keepalive" section explicitly states these are configurable. However +`SiteStreamGrpcClient`'s constructor hard-codes `KeepAlivePingDelay = +TimeSpan.FromSeconds(15)` and `KeepAlivePingTimeout = TimeSpan.FromSeconds(10)` +instead of reading the options. `GrpcMaxStreamLifetime` (the documented "Session +timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere +— `SiteStreamGrpcServer.SubscribeInstance` creates a linked CTS from the call +cancellation token only, with no `CancelAfter`. The 4-hour zombie-stream safety net +described in the design doc does not exist in code. `GrpcMaxConcurrentStreams` is also +not wired to the server (`SiteStreamGrpcServer` takes a `maxConcurrentStreams` +constructor parameter defaulting to 100, but nothing binds the option to it). + +**Recommendation** + +Flow `CommunicationOptions` into `SiteStreamGrpcClient` and `SiteStreamGrpcServer` +(via the factory / DI). Apply `GrpcKeepAlivePingDelay` / `GrpcKeepAlivePingTimeout` to +the `SocketsHttpHandler`, bind `GrpcMaxConcurrentStreams` to the server's limit, and +implement the `GrpcMaxStreamLifetime` session timeout with `CancelAfter` on the +server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the +option and update the design doc. + +**Resolution** + +_Unresolved._ + +### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204` | + +**Description** + +`LoadSiteAddressesFromDb` runs the repository query inside `Task.Run(...).PipeTo(self)`. +If `GetAllSitesAsync` throws (database unavailable, transient connection error), the +faulted task is piped to `Self` as a `Status.Failure`. `CentralCommunicationActor` has +no `Receive` handler, so the failure becomes an unhandled message +(logged at debug, not surfaced) and the periodic refresh silently fails. If the +*first* startup load fails the actor runs with an empty `_siteClients` map — every +`SiteEnvelope` is dropped (line 187) and every Ask times out with no indication of the +root cause. + +**Recommendation** + +Add a `Receive` handler that logs the load failure at Warning/Error +level so operators can distinguish "site has no addresses configured" from "database +is down". Optionally surface a health metric for repeated load failures. + +**Resolution** + +_Unresolved._ + +### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async) + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` | + +**Description** + +`Dispose()` calls `DisposeAsync().AsTask().GetAwaiter().GetResult()`. This is the +classic sync-over-async pattern: it blocks the calling thread until all per-site +`SiteStreamGrpcClient.DisposeAsync` calls complete. If `Dispose` is invoked from a +context with a single-threaded synchronization context or from DI container shutdown +on a constrained thread pool, this can deadlock or stall host shutdown. The class +already implements `IAsyncDisposable`. + +**Recommendation** + +Prefer registering and disposing the factory through `IAsyncDisposable` only (modern +.NET DI honours it for singletons). If a synchronous `Dispose` must remain, dispose +the underlying `GrpcChannel`s directly (synchronous) rather than blocking on the async +path, or document why blocking is safe here. + +**Resolution** + +_Unresolved._ + +### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174` | + +**Description** + +`_retryCount` is reset to 0 every time a single `AttributeValueChanged` or +`AlarmStateChanged` event is received (lines 72, 77). Combined with `MaxRetries = 3`, +a stream that connects, delivers exactly one event, then fails — repeatedly — will +reconnect forever. The design doc states "max 3 retries, terminate the session if all +retries fail"; the current logic only terminates after 3 *consecutive* failures with +zero intervening events, so a flapping site never trips the limit and the debug +session (and its site-side relay) lives on indefinitely. The `ReceiveTimeout` orphan +net is also reset by every received message, so it does not bound this case either. + +**Recommendation** + +Either reset `_retryCount` only after the stream has been stably connected for some +minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or +keep a separate cumulative reconnect counter / time window that bounds total +reconnects regardless of intervening events. + +**Resolution** + +_Unresolved._ + +### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240` | + +**Description** + +`_siteClients` is a non-`readonly` `Dictionary` field. It is only mutated on the actor +thread (correct), but the field is needlessly reassignable, and +`HandleSiteAddressCacheLoaded` mutates it in place across several loops. If +`ActorPath.Parse` throws on a malformed address mid-loop (e.g. a site row with a +garbage `NodeAAddress`), the method aborts partway through, having already stopped +some ClusterClients and added others — leaving the cache partially updated with no +recovery until the next 60s refresh. The other actor mutable collections +(`_debugSubscriptions`, `_inProgressDeployments`) are correctly `readonly`. + +**Recommendation** + +Mark `_siteClients` `readonly`. Validate/parse all addresses up front (or wrap +`ActorPath.Parse` in a try/catch that logs and skips the bad site) so a single +malformed site record cannot abort the whole refresh and leave a half-updated cache. + +**Resolution** + +_Unresolved._ + +### Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor" + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10` | + +**Description** + +The class summary opens with "Persistent actor (one per active debug session)...". +The actor derives from `ReceiveActor`, not a persistent actor base class, holds no +`PersistenceId`, and writes no journal/snapshot. "Persistent" is misleading — debug +sessions are explicitly "session-based and temporary" per the design doc. A reader +could assume state survives restart, which it does not. + +**Recommendation** + +Reword the summary to "Long-lived (per active debug session) actor on the central +side..." or similar, removing the word "Persistent". + +**Resolution** + +_Unresolved._ + +### Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.Communication.Tests/` (module-wide) | + +**Description** + +The test suite covers happy-path routing, handler-not-registered failures, heartbeat +bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical +paths identified in this review have no coverage: + +- The `DebugStreamService.StartStreamAsync` snapshot-timeout path (Communication-001) + — no test verifies bridge actor / site subscription teardown on timeout, nor the + `onTerminated`-before-snapshot race that throws a non-`OperationCanceledException`. +- `CentralCommunicationActor` behaviour when `LoadSiteAddressesFromDb` faults + (Communication-006) — `RefreshSiteAddresses_UpdatesCache` only exercises success. +- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003) + and gRPC reconnect not unsubscribing the old node (Communication-002). +- A malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009). + +**Recommendation** + +Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load +failure logging and empty-cache behaviour; reusing a correlation ID across +`SubscribeAsync` calls; and a malformed site address during cache refresh. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/ConfigurationDatabase/findings.md b/code-reviews/ConfigurationDatabase/findings.md new file mode 100644 index 0000000..ec3ff38 --- /dev/null +++ b/code-reviews/ConfigurationDatabase/findings.md @@ -0,0 +1,394 @@ +# Code Review — ConfigurationDatabase + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.ConfigurationDatabase` | +| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 11 | + +## Summary + +The ConfigurationDatabase module is a focused, conventional EF Core data-access layer: +a single `ScadaLinkDbContext`, Fluent API entity configurations, eight repository +implementations of Commons-defined interfaces, an `IAuditService` implementation, an +`IInstanceLocator`, environment-aware migration handling, and design-time tooling +support. Overall structure adheres well to the design doc and the CLAUDE.md "Code +Organization" decisions — POCO entities and interfaces live in Commons, EF mappings and +implementations live here, Fluent API only, and optimistic concurrency is correctly +applied to `DeploymentRecord` via `rowversion`. The module is generally healthy. + +The main themes across findings are: (1) a genuine logic bug in +`GetTemplateWithChildrenAsync`, which loads child templates and then discards them, so +the method does not deliver what its name implies; (2) secret-bearing columns (SMTP +credentials, external-system auth config, database connection strings) persisted in +plaintext with no encryption-at-rest; (3) a hardcoded SQL `sa` connection string with a +password literal embedded in `DesignTimeDbContextFactory`; (4) the no-arg +`AddConfigurationDatabase()` overload, which silently registers nothing, making a +misconfigured central node fail late and opaquely; and (5) audit-trail robustness gaps — +`AuditService` can throw on serializing entities with navigation cycles, rolling back +the whole business operation, and the design doc's claim that audit `Id` is `Long/GUID` +disagrees with the `int` entity. Test coverage is good for the repositories that have +tests (Security, CentralUI, audit, concurrency, seed data, data protection) but several +repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`, +`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`, +`SiteRepository`, `InstanceLocator`) have little or no direct coverage. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). | +| 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. | +| 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. | +| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). | +| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). | +| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. | +| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. | +| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). | +| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). | +| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. | + +## Findings + +### ConfigurationDatabase-001 — `GetTemplateWithChildrenAsync` loads child templates then discards them + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs:30-41` | + +**Description** + +`GetTemplateWithChildrenAsync` queries for all templates whose `ParentTemplateId` +equals the requested id, assigns the result to the local variable `children`, and +then returns `template` — the `children` list is never used, attached to the returned +object, or otherwise exposed. The method is therefore behaviourally identical to +`GetTemplateByIdAsync` but issues an extra database round-trip. Any caller relying on +the method name to obtain a template with its derived/child templates populated will +silently receive a template with no children, leading to incorrect template-resolution +or UI behaviour with no error. + +**Recommendation** + +Either populate the children onto the returned aggregate (e.g. project into a result +type that carries the children, or load them into a navigation collection that is +actually returned), or remove the dead query and the misleading method if children are +not in fact needed. If the navigation does not exist on the `Template` entity, add an +explicit result tuple/DTO so the loaded data reaches the caller. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-002 — Hardcoded `sa` connection string with embedded password literal + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/DesignTimeDbContextFactory.cs:21-22` | + +**Description** + +`DesignTimeDbContextFactory` falls back to a literal connection string +`"Server=localhost,1433;Database=ScadaLink_Config;User Id=sa;Password=YourPassword;TrustServerCertificate=True"` +when no configured connection string is found. Embedding a credential literal (even a +placeholder) in source code is a poor pattern: it is committed to version control, +encourages copy-paste of `sa`/`TrustServerCertificate=True` into real environments, and +the fallback can mask a genuine misconfiguration during `dotnet ef` operations by +silently pointing tooling at an unintended database. + +**Recommendation** + +Remove the hardcoded fallback. If no connection string is resolved from configuration +or environment, throw a clear `InvalidOperationException` instructing the developer to +set `ScadaLink:Database:ConfigurationDb` (or an environment variable). At minimum, read +the design-time connection string from an environment variable rather than a literal, +and never use `sa`. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-003 — No-arg `AddConfigurationDatabase()` silently registers nothing + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/ServiceCollectionExtensions.cs:44-49` | + +**Description** + +The parameterless `AddConfigurationDatabase()` overload is a deliberate no-op "retained +for backward compatibility during migration." If a central node is wired up with this +overload by mistake, no `ScadaLinkDbContext`, repositories, `IAuditService`, or +`IInstanceLocator` are registered. The failure does not surface at startup; it surfaces +much later as opaque DI resolution exceptions the first time any consumer requests a +repository — far from the actual misconfiguration. The XML comment also refers to +"Phase 0 stubs," which is stale relative to the current state of the module. + +**Recommendation** + +Either delete the no-op overload now that the connection-string overload exists, or +mark it `[Obsolete]` with an error-level message so misuse is a compile-time failure. +If a true "site node" no-op is genuinely required, give it an explicit, self-documenting +name (e.g. `AddConfigurationDatabaseNoOp()`), and remove the stale "Phase 0" wording. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-004 — Secret-bearing columns stored in plaintext with no protection + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/NotificationConfiguration.cs:56-57`, `src/ScadaLink.ConfigurationDatabase/Configurations/ExternalSystemConfiguration.cs:25-26,75-77` | + +**Description** + +`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`, and +`DatabaseConnectionDefinition.ConnectionString` all hold authentication secrets (SMTP +OAuth2 client secrets / passwords, external-system API keys or Basic Auth credentials, +and database passwords respectively). They are mapped as ordinary string columns and +persisted verbatim. Anyone with read access to the configuration database — including +audit-log JSON if these entities are serialized into `AfterStateJson` — obtains the +plaintext secrets. The design doc does not call out encryption-at-rest for these +fields, so the design is also silent on a real risk. + +**Recommendation** + +Apply encryption to these fields, e.g. an EF Core value converter backed by ASP.NET +Data Protection (the module already configures `IDataProtectionKeyContext`), or rely on +SQL Server Always Encrypted / column encryption. Separately, ensure `IAuditService` +callers never pass these secret-bearing entities (or that the serializer redacts the +fields) so secrets do not leak into `AuditLogEntry.AfterStateJson`. Update the design +doc to state the chosen at-rest protection. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-005 — Audit `Id` type disagrees with the design doc + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditConfiguration.cs:11` (entity `src/ScadaLink.Commons/Entities/Audit/AuditLogEntry.cs`) | + +**Description** + +The design doc's Audit Entry Schema table specifies `Id` as `Long / GUID`, and notes +the audit table is append-only and retained indefinitely. The actual `AuditLogEntry` +entity uses an `int` identity key. For a never-purged, append-only table that +accumulates one row per save operation across the system lifetime, a 32-bit identity +risks overflow over a long deployment horizon, and the code drifts from the documented +schema. + +**Recommendation** + +Change `AuditLogEntry.Id` to `long` (and the corresponding migration column to +`bigint`) to match the design doc and remove the overflow risk, or — if `int` is +intentional — update the design doc's schema table to say `int` and justify it. +Resolve the discrepancy in one direction. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-006 — `Site.GrpcNodeAAddress` / `GrpcNodeBAddress` columns are unbounded + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/SiteConfiguration.cs:24-25` | + +**Description** + +`SiteConfiguration` explicitly sets `HasMaxLength(500)` for `NodeAAddress` and +`NodeBAddress`, but the entity also has `GrpcNodeAAddress` and `GrpcNodeBAddress` +(added per the gRPC streaming design decision) which are not configured at all. With no +length set, EF Core maps them to `nvarchar(max)`. This is inconsistent with the sibling +address columns, wastes the opportunity to constrain input, and `nvarchar(max)` columns +cannot be indexed and have different storage/performance characteristics. + +**Recommendation** + +Add `builder.Property(s => s.GrpcNodeAAddress).HasMaxLength(500);` and the same for +`GrpcNodeBAddress`, matching the existing `NodeAAddress`/`NodeBAddress` mapping, and +generate a migration to alter the column types. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-007 — `AuditService` does not handle JSON-serialization failure of arbitrary `afterState` + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Services/AuditService.cs:28-30` | + +**Description** + +`LogAsync` serializes the caller-supplied `afterState` object with +`JsonSerializer.Serialize(afterState)` using default options. EF entity POCOs commonly +have navigation properties; serializing an entity that has loaded navigations (e.g. a +`Template` with `Attributes`/`Scripts`, or any entity with a cycle) will throw +`JsonException` for a reference cycle or produce a very large payload. Because audit +writes are designed to commit in the same transaction as the change, a serialization +exception thrown here will roll back the *entire* business operation — a template +update fails because its audit entry could not be serialized. This couples audit +robustness to the shape of every entity passed in. + +**Recommendation** + +Configure `JsonSerializerOptions` with `ReferenceHandler.IgnoreCycles` (or +`Preserve`) and a sensible `MaxDepth`, and consider serializing a projected +DTO/snapshot rather than the live tracked entity. Decide explicitly whether an audit +serialization failure should fail the operation or be logged and degraded gracefully, +and document that decision against the design doc's transactional-guarantee section. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-008 — `GetApprovedKeysForMethodAsync` CSV parsing silently drops malformed ids + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:46-58` | + +**Description** + +`ApiMethod.ApprovedApiKeyIds` is stored as a comma-separated string of integer ids. +`GetApprovedKeysForMethodAsync` splits it, maps each token with +`int.TryParse(...) ? id : -1`, then filters with `id > 0`. Any token that fails to +parse, or a legitimately negative/zero id, is silently discarded. If `ApprovedApiKeyIds` +becomes corrupt (e.g. a stray name instead of an id), the method quietly returns fewer +approved keys than expected, which for an API-key authorization path means a method may +unexpectedly reject a key that should be approved. Storing a relational many-to-many as +a CSV string in a column is itself fragile (no FK integrity, no cascade on key delete). + +**Recommendation** + +Short term: log a warning when a token fails to parse instead of silently dropping it, +so corruption is observable. Longer term: replace the CSV column with a proper join +table (`ApiMethodApprovedKey`) with foreign keys to `ApiMethod` and `ApiKey`, which +gives referential integrity and correct cascade behaviour when an API key is deleted. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-009 — Multi-collection eager loads issue cartesian-product queries + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs:43-51,53-61`, `src/ScadaLink.ConfigurationDatabase/Repositories/CentralUiRepository.cs:45-55` | + +**Description** + +`GetAllTemplatesAsync`, `GetTemplatesComposingAsync`, and `GetTemplateTreeAsync` each +`Include` three-to-four sibling collections (`Attributes`, `Alarms`, `Scripts`, +`Compositions`) in a single query. EF Core's default single-query strategy produces a +cartesian-product join across those collections, so a template with N attributes, M +alarms, and K scripts yields N×M×K rows that EF must then de-duplicate. For templates +with many members this materially inflates the result set and query time. +`GetInstanceByIdAsync`/`GetAllInstancesAsync` have the same shape with three +collections. + +**Recommendation** + +Add `.AsSplitQuery()` to these multi-collection-include queries (or set +`UseQuerySplittingBehavior(QuerySplittingBehavior.SplitQuery)` globally in +`AddConfigurationDatabase`) so each collection is loaded with a separate query and the +cartesian explosion is avoided. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-010 — Several repositories and `InstanceLocator` lack direct test coverage + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs`, `Repositories/DeploymentManagerRepository.cs`, `Repositories/ExternalSystemRepository.cs`, `Repositories/InboundApiRepository.cs`, `Repositories/NotificationRepository.cs`, `Repositories/SiteRepository.cs`, `Services/InstanceLocator.cs` | + +**Description** + +The test project covers `SecurityRepository`, `CentralUiRepository`, `AuditService`, +optimistic concurrency, seed data, and Data Protection persistence. There are no direct +tests for `TemplateEngineRepository` (the largest repository, and the one with the +CD-001 bug, which a test would have caught), `DeploymentManagerRepository` (including +its `Local`-then-stub delete fallback and the `DeleteInstanceAsync` +restrict-FK-cleanup logic), `ExternalSystemRepository`, `InboundApiRepository` (notably +`GetApprovedKeysForMethodAsync` CSV parsing — CD-008), `NotificationRepository`, +`SiteRepository` (including its stub-attach delete path), or `InstanceLocator`. + +**Recommendation** + +Add repository-level tests using the existing `SqliteTestHelper` pattern, covering at +minimum: CRUD round-trips, the stub-attach delete fallbacks in +`DeploymentManagerRepository`/`SiteRepository`, `DeleteInstanceAsync`'s explicit +deployment-record cleanup, `GetApprovedKeysForMethodAsync` with valid/malformed CSV, +and `InstanceLocator.GetSiteIdForInstanceAsync` for found/not-found cases. + +**Resolution** + +_Unresolved._ + +### ConfigurationDatabase-011 — Inconsistent constructor null-guarding across repositories/services + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/ExternalSystemRepository.cs:11-14`, `Repositories/InboundApiRepository.cs:11-14`, `Repositories/NotificationRepository.cs:11-14`, `Services/InstanceLocator.cs:13-16` | + +**Description** + +`SecurityRepository`, `CentralUiRepository`, `TemplateEngineRepository`, +`DeploymentManagerRepository`, `SiteRepository`, and `AuditService` all guard their +injected `ScadaLinkDbContext` with `?? throw new ArgumentNullException(...)`. +`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`, and +`InstanceLocator` assign the constructor argument directly with no guard. This is a +minor consistency/maintainability issue: although the DI container will not normally +supply null, the divergence makes the codebase look unfinished and means a future +hand-constructed instance fails with a less informative `NullReferenceException` later. + +**Recommendation** + +Apply the same `?? throw new ArgumentNullException(nameof(context))` guard in the four +inconsistent constructors so all data-access types behave uniformly. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/DataConnectionLayer/findings.md b/code-reviews/DataConnectionLayer/findings.md new file mode 100644 index 0000000..95aeba2 --- /dev/null +++ b/code-reviews/DataConnectionLayer/findings.md @@ -0,0 +1,471 @@ +# Code Review — DataConnectionLayer + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.DataConnectionLayer` | +| Design doc | `docs/requirements/Component-DataConnectionLayer.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 13 | + +## Summary + +The DataConnectionLayer is a reasonably well-structured module: the Become/Stash +lifecycle state machine, the captured-`Self` marshalling of background-thread +disconnect events, and the protocol-factory abstraction all follow the design doc +and Akka.NET conventions. However, the review found one **critical** actor-model +violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private +dictionaries and counters from a thread-pool thread, racing with the actor's own +message loop. Several **high**-severity issues cluster around concurrency and error +handling: the subscription-failure path leaves the connection with degraded subtrees +but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops +all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored- +item callback dictionary is mutated without synchronization while OPC UA notification +threads read it. The remaining findings concern stale health counters after failover, +an unused `WriteTimeout` option (writes are unbounded despite the design promising a +30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between +the design doc's failover state machine and the implemented unstable-disconnect +heuristic. Test coverage is adequate for the happy paths and failover but absent for +tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. | +| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. | +| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. | +| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. | +| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. | +| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. | +| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. | +| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. | +| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. | +| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. | + +## Findings + +### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread + +| | | +|--|--| +| Severity | Critical | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` | + +**Description** + +`HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool +thread and directly mutates the actor's private mutable state: `instanceTags` (a +reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`, +`_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and +written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`, +`ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a +direct violation of the Akka.NET actor model, which guarantees single-threaded access +to actor state only when state is touched on the actor thread. Two concurrent +subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`, +produce data races on `Dictionary`/`HashSet`/`int` — `Dictionary` is not thread-safe +and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can +also produce torn reads of the health counters. + +**Recommendation** + +Do not mutate actor state from the background task. Perform only the `await +_adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results +into a local immutable result object, and `PipeTo(Self)` an internal message (e.g. +`SubscribeCompleted`) whose handler — running on the actor thread — applies all state +mutations and counter updates. The response to `Sender` should be sent from that +handler too. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash + +| | | +|--|--| +| Severity | High | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` | + +**Description** + +`DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with +`Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a +fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`, +`_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are +silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the +design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors +that had subscribed before the crash never get their tags re-subscribed and will sit +at uncertain/stale quality indefinitely with no error returned. There is no durable +subscription store from which a restarted actor could rebuild state. + +**Recommendation** + +Either (a) make the subscription registry durable/recoverable so a restarted actor +can rebuild it (persist to local SQLite as the design doc says connection definitions +are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash +as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the +subscription registrations. At minimum document that subscribers must re-register +after a crash and surface the lost-state condition rather than failing silently. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` | + +**Description** + +`_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are +written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the +`DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool +threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from +the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on +the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line +163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined +behaviour — they can throw `InvalidOperationException`, return wrong entries, or +corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off +the actor thread (finding 001), so multiple subscribe calls can also race each other +here. + +**Recommendation** + +Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all +access with a lock. Note that fixing finding 001 (serialising subscribe through the +actor thread) reduces but does not eliminate the race, because the SDK notification +threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` / +`DisconnectAsync`. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` | + +**Description** + +When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task, +the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`, +treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync` +also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA +client is not connected, and throws on transport faults — these are connection +problems, not bad tag paths. They get misclassified as unresolved tags and retried on +the 10 s tag-resolution timer instead of triggering the reconnection state machine. +Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute +must be marked quality `bad`; the code never pushes a bad-quality update to the +subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor +stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends +to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`). + +**Recommendation** + +Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the +reconnect machine handle them) from genuine node-not-found errors. For genuine +resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing +Instance Actor so it reflects the documented behaviour. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` | + +**Description** + +`DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared +Settings" table both promise a bounded timeout for synchronous device writes. The +value is never read anywhere in the module (`grep` confirms only the declaration). +`HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no +`CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no +RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling +script's Ask blocks until its own ask-timeout — and the script gets no DCL-level +error. The design states write failures (including timeout) must be returned +synchronously to the script; an unbounded write violates that. + +**Recommendation** + +Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to +`WriteAsync`, and in the continuation translate cancellation into a failed +`WriteTagResponse` with a timeout error message. Apply the same to the read used by +the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` | + +**Description** + +`ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a +clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and +`_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect) +sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a +reconnect, `HandleTagValueReceived` decrements the *old* bucket using +`_lastTagQuality`'s value and increments the new one — but tags resolved for the first +time after reconnect were never in `_lastTagQuality`, so they only increment, never +decrement, and the totals can drift above `_totalSubscribed`. Over repeated +disconnect/reconnect cycles the health report's good/bad/uncertain counts become +unreliable. + +**Recommendation** + +On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear +`_lastTagQuality` and the three quality counters and let them be repopulated from +fresh `TagValueReceived` messages. Alternatively recompute the buckets from +`_lastTagQuality` whenever it changes rather than maintaining incremental counters. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` | + +**Description** + +`ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any +non-cancellation exception (line 184). So if any single tag in the batch throws (bad +node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no +results for the tags that *did* read successfully — even though `ReadResult` already +has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is +also fully serial (one round-trip per tag), defeating the point of a batch API; the +design doc lists `ReadBatch`/`WriteBatch` as first-class operations. + +**Recommendation** + +Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag +so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call +for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a +`ReadValueIdCollection` — extend it to accept multiple nodes). + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` | + +**Description** + +For each tag of the instance being removed, `HandleUnsubscribe` scans every other +instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation +O(tags x instances). On a site with many instances sharing a connection this is +needlessly expensive on every instance stop/redeploy. Separately, line 562 +re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already +removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead +logic that obscures intent (the decrement of `_resolvedTags` is unconditional in +practice). + +**Recommendation** + +Maintain a reference count per tag path (or a `tagPath -> set` reverse index) +so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags` +re-check or restructure so the resolved/unresolved decrement reflects the tag's actual +prior state captured before removal. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` | + +**Description** + +The design doc's failover state machine reads "retry active endpoint (5s) -> N failures +(>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate* +failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on +connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally +counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped +within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count +too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can +happen on *successful-but-flaky* connections are not described in the component doc at +all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not +predict this behaviour, and the 60 s threshold is a magic constant not exposed via +`DataConnectionOptions`. + +**Recommendation** + +Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover +path and the stability threshold, and move the 60 s threshold into +`DataConnectionOptions` so it is configurable and consistent with the other tunables. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` | + +**Description** + +`HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags` +via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from +`_unresolvedTags` while the attempts are in flight. Because tags are not removed +before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues +duplicate concurrent subscribe attempts for the same tag, which can create duplicate +monitored items / leaked subscription IDs (the second success overwrites +`_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle +with no `UnsubscribeAsync` call). The timer-cancel condition in +`HandleTagResolutionSucceeded` is also non-deterministic for the same reason. + +**Recommendation** + +Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is +dispatched, and only put them back on failure. This prevents overlapping duplicate +subscribe attempts and makes the timer-cancel condition deterministic. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` | + +**Description** + +On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`, +fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks +captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the +`Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches +`Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads +have not yet fully torn down could still deliver a value, mixing pre-failover device +data with the new endpoint's data and briefly reporting a value the active endpoint +never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to +distinguish current from stale callbacks. + +**Recommendation** + +Add an adapter-generation counter incremented on every adapter swap; stamp it onto +`TagValueReceived` (captured in the callback closure) and drop messages whose +generation does not match the current adapter in `HandleTagValueReceived`. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` | + +**Description** + +`OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and +`RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true` +when it is set. With the default, every server certificate is accepted unconditionally +— there is no certificate-pinning or trust-store enforcement — which defeats the +`Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC +UA link. The design doc explicitly lists `true` as the default. For an industrial +control link this is a meaningful exposure; a secure-by-default posture would reject +untrusted certs unless an operator opts in per connection. + +**Recommendation** + +Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection +opt-in, or at minimum log a prominent warning whenever the auto-accept validator is +installed. Update the design doc to reflect the secure default. + +**Resolution** + +_Unresolved._ + +### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` | + +**Description** + +The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers +the event." The implementation is a non-atomic check-then-set on a `volatile bool` +(`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees +visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via +`OnClientConnectionLost` and a `ReadAsync` failure path) can both observe +`_disconnectFired == false` and both invoke `Disconnected`. In practice the +`DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact +is low, but the comment overstates the guarantee. The same pattern exists in +`RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`). + +**Recommendation** + +Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock), +or correct the comment to say "best-effort once-only; a duplicate event is possible +under a race and is tolerated downstream." + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/DeploymentManager/findings.md b/code-reviews/DeploymentManager/findings.md new file mode 100644 index 0000000..34fc601 --- /dev/null +++ b/code-reviews/DeploymentManager/findings.md @@ -0,0 +1,493 @@ +# Code Review — DeploymentManager + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.DeploymentManager` | +| Design doc | `docs/requirements/Component-DeploymentManager.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 14 | + +## Summary + +The DeploymentManager module is small, well-structured, and clearly maps work +packages (WP-N) onto code. The happy paths for instance deployment, lifecycle +commands, artifact broadcast, and staleness comparison are implemented +sensibly, and the operation lock correctly serializes mutating operations per +instance while allowing cross-instance parallelism. However, the review found a +significant cluster of error-handling and resilience gaps: the deployment +record can be left permanently stuck in `InProgress` when an exception other +than timeout/cancellation is thrown, the catch block writes its failure status +using a cancellation token that may already be cancelled, and the +`OperationLockManager` leaks one `SemaphoreSlim` per instance name forever. +There are also two notable design-document adherence gaps: the +"query-the-site-before-redeploy" idempotency requirement is not implemented +(`GetDeploymentStatusAsync` only reads the local DB), and the "Diff View" +feature is reduced to a bare hash comparison with no added/removed/changed +detail. Configuration is not bound to `appsettings.json`, leaving one option +entirely dead. Test coverage stops at the communication boundary and never +exercises a successful deployment or the lifecycle success paths. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. | +| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. | +| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. | +| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. | +| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). | +| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. | +| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). | +| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. | +| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. | +| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. | + +## Findings + +### DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress` + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199` | + +**Description** + +`DeployInstanceAsync` sets the record to `InProgress` (lines 137-139), then the +`try` block calls into `CommunicationService` and the repository. The only +`catch` filter is `when (ex is TimeoutException or OperationCanceledException)`. +Any other exception — `InvalidOperationException` (thrown by +`CommunicationService.GetCommunicationActor()` when the actor is not set), a +JSON serialization error, a deserialization failure of the response, a DB +exception on `UpdateDeploymentRecordAsync`, or any transport error — escapes the +method. The deployment record remains in `DeploymentStatus.InProgress` +permanently. Because staleness and the UI both read current status, the +instance is then misreported as "deploying" forever and a re-deploy may be +blocked or misinterpreted. The design explicitly states an interrupted +deployment must be "treated as failed". + +**Recommendation** + +Broaden the catch to a general `catch (Exception ex)` that records +`DeploymentStatus.Failed` with the error message, audit-logs the failure, and +re-throws or returns a failed `Result`. Keep the timeout-specific branch only +if a distinct message is desired. Ensure the failure-status write happens for +every exit path out of the `try`. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196` | + +**Description** + +The `catch (Exception ex) when (ex is TimeoutException or +OperationCanceledException)` block updates the record to `Failed` and calls +`UpdateDeploymentRecordAsync`/`SaveChangesAsync`/`LogAsync` passing the same +`cancellationToken` that was just cancelled (an `OperationCanceledException` +caught here means the token is already in the cancelled state). Those +repository and audit calls will themselves throw `OperationCanceledException` +before the failure status is persisted, so the record stays `InProgress` — the +exact bug DeploymentManager-001 describes, reached via the supposedly-handled +path. + +**Recommendation** + +Perform the cleanup writes with a fresh, non-cancellable token (e.g. +`CancellationToken.None`, optionally with an independent short timeout) so the +failure status is durably recorded even when the original operation was +cancelled or timed out. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` | + +**Description** + +After a successful site response the code calls `UpdateDeploymentRecordAsync` +(no `SaveChanges` yet), then `UpdateInstanceAsync`, then +`StoreDeployedSnapshotAsync` (which itself issues `Add`/`Update` calls), then a +single `SaveChangesAsync` at line 170. If `StoreDeployedSnapshotAsync` throws, +the exception is not caught (see DeploymentManager-001) and the +`SaveChangesAsync` never runs — the instance state, deployment status, and +snapshot are all left unpersisted even though the site has actually applied the +deployment. Central and site are now divergent: the site is running the new +config but central still shows the old state and a non-`Success` deployment +record. + +**Recommendation** + +Wrap the post-success persistence so that, at minimum, the deployment record's +`Success` status is committed. Consider committing the status first, then the +instance state and snapshot, so a later failure does not lose the fact that the +site succeeded. Log loudly if the snapshot write fails after a confirmed site +apply. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` | + +**Description** + +In `DeleteInstanceAsync`, when the site responds `Success` the code calls +`_repository.DeleteInstanceAsync` then `SaveChangesAsync`. If `SaveChangesAsync` +throws (DB error, concurrency), the exception propagates uncaught: the site has +already destroyed the Instance Actor and removed its config, but the central +instance record still exists. The instance is now un-deletable through the +normal path (the site no longer has it, so a re-issued delete may fail) and is +permanently orphaned. The design states central must not mark the instance +deleted until the site confirms — but it does not address the inverse failure. + +**Recommendation** + +Catch persistence failures in the post-success block and surface a distinct +error indicating the site succeeded but the central record could not be +removed, so an operator/retry can reconcile. Consider making the central delete +idempotent and retryable independently of the site command. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` | + +**Description** + +`AcquireAsync` does `_locks.GetOrAdd(instanceUniqueName, _ => new +SemaphoreSlim(1, 1))` and entries are never removed. Every distinct instance +unique name that is ever deployed/disabled/enabled/deleted permanently adds a +`SemaphoreSlim` (an `IDisposable` holding a kernel wait handle) to the +dictionary. Over the lifetime of a long-running central process — especially +with the bulk "deploy all out-of-date instances" workflow and instances that +are created and deleted over time — this is an unbounded leak of both managed +memory and OS handles. Deleted instances' semaphores are never reclaimed. + +**Recommendation** + +Either accept the leak explicitly and document the expected bounded cardinality +of instance names, or implement reclamation: e.g. ref-count handles and remove ++ `Dispose()` the semaphore when the count reaches zero and the lock is free. +At minimum, remove the semaphore entry when an instance is deleted +(`DeleteInstanceAsync`). + +**Resolution** + +_Unresolved._ + +### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368` | + +**Description** + +The design ("Deployment Identity & Idempotency") requires: "After a central +failover or timeout, the Deployment Manager queries the site for current +deployment state before allowing a re-deploy. This prevents duplicate +application and out-of-order config changes." The code never does this. +`GetDeploymentStatusAsync` only reads the local `DeploymentRecord` from the DB +(`GetDeploymentByDeploymentIdAsync`) — it does not contact the site. +`DeployInstanceAsync` unconditionally generates a new deployment ID and sends a +new `DeployInstanceCommand` regardless of any prior in-flight or timed-out +deployment. After a timeout where the site actually applied the config, a +re-deploy produces a second deployment with no reconciliation against the +site's current revision hash. Site-side stale-rejection is the only safety +net, and that is not verified here. + +**Recommendation** + +Add a site query (a new `CommunicationService` pattern returning the site's +currently-applied deployment ID / revision hash) and call it before re-deploy +when a prior record for the instance is in `InProgress`/`Failed` due to +timeout. Reconcile: if the site already has the target revision, mark the prior +record `Success` instead of re-sending. Either implement this or update the +design doc to reflect that reconciliation is delegated entirely to site-side +stale-rejection. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` | + +**Description** + +The design ("Diff View" and "Dependencies" sections) states the Deployment +Manager can request a diff from the Template Engine showing added/removed +members, changed values, and connection-binding changes. +`GetDeploymentComparisonAsync` and `DeploymentComparisonResult` only compare two +revision hashes and return a boolean `IsStale` plus the two hashes. No +added/removed/changed detail is produced, and the Template Engine's diff +capability is not invoked. The UI cannot render a meaningful diff from this +result. + +**Recommendation** + +Either implement a real diff (deserialize the stored +`DeployedConfigSnapshot.ConfigurationJson` and the freshly flattened config and +invoke the Template Engine's diff service, surfacing structured +added/removed/changed entries), or revise the design doc to scope the feature +down to staleness detection only. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` | + +**Description** + +`AddDeploymentManager` registers the services but never calls +`services.Configure(configuration.GetSection(...))`. +`IOptions` therefore always resolves to a +default-constructed instance — the operation-lock and artifact-deployment +timeouts cannot be tuned via `appsettings.json`, contrary to the CLAUDE.md +convention "Per-component configuration via `appsettings.json` sections bound +to options classes (Options pattern)." `Host/Program.cs` binds +`SecurityOptions` and `InboundApiOptions` from configuration sections but has +no equivalent for `DeploymentManagerOptions`. + +**Recommendation** + +Add an `IConfiguration` parameter (or a configure callback) to +`AddDeploymentManager` and bind `DeploymentManagerOptions` to a section such as +`ScadaLink:DeploymentManager`, consistent with the other components. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync` + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:288` | + +**Description** + +The XML doc says "Delete fails if site unreachable (30s timeout via +CommunicationOptions)." The actual delete timeout is whatever +`CommunicationOptions.LifecycleTimeout` is configured to (passed inside +`CommunicationService.DeleteInstanceAsync`); the "30s" figure is hard-coded +into the comment and not derived from any constant in this module. If +`LifecycleTimeout` is reconfigured, the comment becomes wrong. It also wrongly +implies the value lives in this module. + +**Recommendation** + +Reword to "Delete fails if the site is unreachable within +`CommunicationOptions.LifecycleTimeout`" without quoting a specific number. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211` | + +**Description** + +`DeployToAllSitesAsync` generates a `deploymentId` (line 136) and returns it in +the `ArtifactDeploymentSummary` and audit log, but the persisted +`SystemArtifactDeploymentRecord` has no field for it (the entity only has `Id`, +`ArtifactType`, `DeployedBy`, `DeployedAt`, `PerSiteStatus`). The deployment ID +that appears in the UI summary and audit log cannot be correlated back to the +stored record. Additionally each per-site `DeployArtifactsCommand` carries its +own separate GUID (`BuildDeployArtifactsCommandAsync` line 114), so there are in +fact N+1 unrelated IDs for one logical artifact deployment. + +**Recommendation** + +Add a `DeploymentId` column to `SystemArtifactDeploymentRecord` and store the +single logical `deploymentId`; reuse that ID (or a derived per-site ID) for the +per-site commands so the audit log, UI summary, and persisted record agree. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` | + +**Description** + +`DeploymentServiceTests` never sets the `CommunicationService` actor, so every +deploy/lifecycle test deliberately stops at the `InvalidOperationException` +thrown by `GetCommunicationActor()` (see lines 118-125, 147). As a result there +is no test covering: a successful deployment (`DeploymentStatus.Success` +response → instance state set to `Enabled`, snapshot stored, audit logged); a +failed-but-handled site response; the `InProgress`-stuck bug +(DeploymentManager-001); successful Disable/Enable/Delete; or the operation +lock actually serializing two concurrent deploys of the same instance. The +critical post-response branch (`DeploymentService.cs:154-184`) and the entire +delete/disable/enable success path are untested. The `AuditLogs` test +(lines 277-289) asserts nothing. + +**Recommendation** + +Introduce a seam to inject a fake/substitute communication path (e.g. an +interface over `CommunicationService`, or wire a TestKit actor) so success and +handled-failure paths can be unit tested. Add tests for the stuck-`InProgress` +scenario and for per-instance lock contention during deploy. Make the audit +test assert on `IAuditService.LogAsync`. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9` | + +**Description** + +`DeploymentManagerOptions.LifecycleCommandTimeout` is declared with a 30s +default and an XML doc, but it is never read anywhere in the codebase +(lifecycle commands rely on `CommunicationOptions.LifecycleTimeout` inside +`CommunicationService`). The option misleads readers into thinking it controls +disable/enable/delete timeouts, when setting it has no effect. + +**Recommendation** + +Remove `LifecycleCommandTimeout`, or actually thread it through to the +lifecycle command calls (e.g. by creating a linked CTS with this timeout in +`DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`, the way +`ArtifactDeploymentTimeoutPerSite` is used). + +**Resolution** + +_Unresolved._ + +### DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites + +| | | +|--|--| +| Severity | Low | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111` | + +**Description** + +`BuildDeployArtifactsCommandAsync` maps `smtp.Credentials` directly into +`SmtpConfigurationArtifact` and that command is sent to every site. Distributing +SMTP credentials to sites is consistent with the design (SMTP configuration is +a deployable artifact), but the credentials travel inside a serialized command +across the inter-cluster transport and are stored on each site's SQLite. There +is no indication the value is encrypted at rest on the site or scrubbed from +logs. Worth confirming the transport is TLS-protected and the site stores the +credential securely; at minimum this should be a conscious, documented decision. + +**Recommendation** + +Confirm inter-cluster transport encryption covers artifact commands, ensure +`Credentials` is never written to logs, and document the at-rest protection of +SMTP credentials on site SQLite. Consider encrypting the credential field +within the artifact payload. + +**Resolution** + +_Unresolved._ + +### DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90` | + +**Description** + +The private static `CreateCommand()` helper is never referenced by any test in +the file. It is dead code that suggests an intended test (e.g. a successful +multi-site artifact deployment) was never written — coverage of +`DeployToAllSitesAsync` is limited to the no-sites failure case, and +`RetryForSiteAsync` and `BuildDeployArtifactsCommandAsync` have no tests at all. + +**Recommendation** + +Either remove the unused helper or, preferably, write the missing tests for +`DeployToAllSitesAsync` (per-site success/failure matrix, partial failure) and +`RetryForSiteAsync` using it. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/ExternalSystemGateway/findings.md b/code-reviews/ExternalSystemGateway/findings.md new file mode 100644 index 0000000..e8e90a2 --- /dev/null +++ b/code-reviews/ExternalSystemGateway/findings.md @@ -0,0 +1,512 @@ +# Code Review — ExternalSystemGateway + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.ExternalSystemGateway` | +| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 14 | + +## Summary + +The External System Gateway is a small module (five source files plus options) that +implements the HTTP/REST client (`ExternalSystemClient`), the database access surface +(`DatabaseGateway`), and error classification (`ErrorClassifier`). The structure is +clean and the dual call-mode semantics broadly match the design doc. However, the +review surfaced several substantive problems that prevent the module from behaving as +designed. The most serious is that **no store-and-forward delivery handler is ever +registered** for the `ExternalSystem` or `CachedDbWrite` categories, so cached calls +and cached writes are buffered but can never actually be delivered on retry — a silent +data-loss path. Two further high-impact issues are that the **per-system call timeout +is never applied** to the HTTP client (the design's central error-handling guarantee +is absent), and that **`CachedCall` double-dispatches the HTTP request** because +`StoreAndForwardService.EnqueueAsync` itself re-attempts immediate delivery, breaking +the idempotency expectations. A cluster of medium issues concern resource leaks, +classification gaps (cancellation conflation), and the dropped `StoreAndForwardResult`. +Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGateway` +are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc +requirements (timeout, retry settings) that are declared but not implemented. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. | +| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. | +| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. | +| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. | +| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. | +| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. | +| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. | +| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. | +| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. | +| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. | + +## Findings + +### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered + +| | | +|--|--| +| Severity | Critical | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81` | + +**Description** + +`CachedCallAsync` and `CachedWriteAsync` enqueue messages under +`StoreAndForwardCategory.ExternalSystem` and `StoreAndForwardCategory.CachedDbWrite`. +`StoreAndForwardService.RegisterDeliveryHandler` is the only mechanism that lets the +S&F engine actually deliver a buffered message, and a repository-wide search shows it +is **never called for either category** anywhere in the codebase. Consequences: + +1. On a transient failure, `EnqueueAsync` falls through to the "No handler registered + — buffer for later" branch (`StoreAndForwardService.cs:163`) and the message is + persisted. +2. During the retry sweep, `AttemptDeliveryAsync` (`StoreAndForwardService.cs:201`) + logs `"No delivery handler for category {Category}"` and returns without ever + removing or delivering the message. + +The result is that every cached external call and cached DB write is silently +buffered forever and never delivered — a data-loss path for the exact "deferred +delivery is acceptable" use cases the design doc calls out (posting production data, +quality reports). The script also receives `WasBuffered: true` / a successful +`CachedWriteAsync` completion, so the failure is completely invisible. + +**Recommendation** + +Register delivery handlers for `StoreAndForwardCategory.ExternalSystem` and +`StoreAndForwardCategory.CachedDbWrite` during host/site startup. The `ExternalSystem` +handler should deserialize the payload, re-resolve the system/method, and re-invoke +`InvokeHttpAsync`, returning `true`/`false`/throwing per the transient-vs-permanent +contract `EnqueueAsync` expects. The `CachedDbWrite` handler should execute the SQL +against the named connection. Add an integration test that buffers a message and +verifies it is delivered by a retry sweep. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` | + +**Description** + +The design doc states each external system definition specifies a timeout that +"applies to all method calls on that system" and "applies to the HTTP request +round-trip", and `ExternalSystemGatewayOptions.DefaultHttpTimeout` exists as a +fallback. In practice no timeout is ever configured. `ServiceCollectionExtensions` +calls `services.AddHttpClient()` with no per-named-client configuration, and +`InvokeHttpAsync` calls `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")` +without setting `client.Timeout` or passing a `CancellationToken` derived from a +timeout. `SendAsync` is therefore subject only to `HttpClient`'s default 100-second +timeout, regardless of the system definition or the configured `DefaultHttpTimeout`. +A slow or hung external system will block the calling Script Execution Actor far +longer than the operator configured, and the design's core error-handling guarantee +(timeout → transient classification) does not hold within the intended window. + +There is also no `Timeout` field on `ExternalSystemDefinition` at all, so even a +correct implementation has nowhere to read the per-system value from — the entity is +missing the field the design requires. + +**Recommendation** + +Add a `Timeout` (TimeSpan) field to `ExternalSystemDefinition` and have +`InvokeHttpAsync` enforce it — either by setting `client.Timeout` via a typed/named +`HttpClient` registration, or by linking a `CancellationTokenSource` with the +per-system (or `DefaultHttpTimeout`) timeout to the supplied `cancellationToken` +before `SendAsync`. Ensure the resulting `TaskCanceledException`/`TimeoutException` +is classified as transient. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-003 — `CachedCall` double-dispatches the HTTP request + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117` | + +**Description** + +`CachedCallAsync` first calls `InvokeHttpAsync` directly (line 86). On a +`TransientExternalSystemException` it then calls `_storeAndForward.EnqueueAsync(...)` +(line 109). `StoreAndForwardService.EnqueueAsync` is **not** a pure enqueue — it +"Attempts immediate delivery" by invoking the registered delivery handler +(`StoreAndForwardService.cs:128-159`). If a delivery handler for the `ExternalSystem` +category is registered (as finding 001 recommends), the HTTP request will be executed +a **second time** synchronously inside `EnqueueAsync`, immediately after the first +attempt failed. For a transient failure that is actually a slow/overloaded system, +this doubles the load and — critically — if the original request did reach the +external system, the immediate retry produces a duplicate delivery before the script +even returns, worsening the idempotency hazard the design doc explicitly warns about. + +**Recommendation** + +Decide on one dispatch path. Either (a) have `CachedCall` not pre-invoke +`InvokeHttpAsync` and instead let `EnqueueAsync`'s immediate-delivery attempt be the +single first attempt (requires the handler to exist and to surface permanent vs +transient correctly); or (b) add an enqueue-only entry point to +`StoreAndForwardService` that skips the immediate-delivery attempt, and have +`CachedCall` use it after its own first attempt. Approach (a) is cleaner and removes +the duplicated logic. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87` | + +**Description** + +`CachedCallAsync` and `CachedWriteAsync` pass the definition's `MaxRetries` / +`RetryDelay` to `EnqueueAsync` only when they are non-default +(`MaxRetries > 0 ? ... : null`, `RetryDelay > TimeSpan.Zero ? ... : null`), otherwise +falling back to the S&F defaults. The site-side repository that supplies these +definitions, `SiteExternalSystemRepository.MapExternalSystem` +(`src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194`), never +reads `MaxRetries`/`RetryDelay` from SQLite at all — the constructed entities always +have `MaxRetries == 0` and `RetryDelay == TimeSpan.Zero`. As a result, at sites the +per-system retry settings the design doc requires are *always* discarded and the +global S&F defaults are silently used instead. The `> 0` guard in the ESG also makes +a legitimately-configured `MaxRetries` of 0 ("never retry") indistinguishable from +"unset", so an operator cannot express "do not retry". + +**Recommendation** + +Within this module, drop the `> 0` / `> Zero` guards and pass the definition values +through directly (or use nullable fields on the entity to distinguish "unset"). The +companion fix in `SiteExternalSystemRepository` to actually map the retry columns +should be tracked against the SiteRuntime module. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-005 — `HttpRequestMessage` and `HttpResponseMessage` are not disposed + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167` | + +**Description** + +`InvokeHttpAsync` creates an `HttpRequestMessage` (line 133) and receives an +`HttpResponseMessage` from `SendAsync` (line 155); neither is wrapped in a `using` nor +explicitly disposed. Both are `IDisposable` and own resources (the request's +`StringContent`, the response's content stream). Under the per-invocation call volume +of a busy site this produces avoidable pressure on the finalizer queue and can hold +socket/stream resources longer than necessary. The success path reads the content but +never disposes the response; the error path likewise reads `errorBody` and then throws +without disposing. + +**Recommendation** + +Wrap the request in `using var request = ...` and the response in +`using var response = ...` (or call `Dispose()` in a `finally`). Ensure disposal still +occurs on the exception paths. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-006 — `BuildUrl` ignores path templates and appends a trailing slash for empty paths + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196` | + +**Description** + +`BuildUrl` does `baseUrl.TrimEnd('/') + "/" + path.TrimStart('/')`. When `method.Path` +is empty (a method that targets the base URL itself), this still appends a `/`, +producing `https://host/api/` which some servers treat as a different resource than +`https://host/api`. More importantly, the design doc shows method paths as templates +like `/recipes/{id}`, but `BuildUrl` performs no placeholder substitution — a `{id}` +token is sent literally in the URL and the corresponding parameter is instead appended +as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either +the design's path-template feature is unimplemented, or the doc is stale; in the +current code a method defined as `/recipes/{id}` will never produce a correct URL. + +**Recommendation** + +Decide whether path templating is in scope. If yes, implement `{name}` substitution +from `parameters` in `BuildUrl` and exclude substituted parameters from the query +string/body. If no, update the component design doc to remove the `/recipes/{id}` +example and state that paths are literal. Also avoid appending a trailing `/` when +`path` is empty. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177` | + +**Description** + +On a non-success HTTP response, the full response body is read into `errorBody` and +embedded verbatim into the exception message (`$"HTTP {code} from {name}: {errorBody}"`), +which then flows into `ExternalCallResult.ErrorMessage` and back to the calling script, +and into Site Event Logging. An external system error page can be arbitrarily large +(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There +is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry +and error string returned to scripts. There is also no content-type check before +treating the body as text. + +**Recommendation** + +Truncate `errorBody` to a bounded length (e.g. 1–2 KB) before embedding it, and +consider logging the full body separately at debug level rather than returning it to +the script. Optionally only include the body when the content type is textual. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30`, `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159` | + +**Description** + +`ErrorClassifier.IsTransient(Exception)` returns `true` for `TaskCanceledException` +and `OperationCanceledException`. `HttpClient.SendAsync` throws `TaskCanceledException` +both when its internal timeout elapses *and* when the supplied `CancellationToken` is +cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting +down). Because `InvokeHttpAsync`'s `catch` filter treats all of these as transient, a +caller-initiated cancellation during a `CachedCall` will be misclassified as a +transient failure and the message will be buffered for retry — work the caller +explicitly asked to abandon. For a `Call`, a shutdown-time cancellation is reported to +the script as a "Transient error" rather than an `OperationCanceledException`. + +**Recommendation** + +In `InvokeHttpAsync`, check `cancellationToken.IsCancellationRequested` first and +rethrow `OperationCanceledException` (or let it propagate) before applying transient +classification. Only treat a cancellation as a timeout when the supplied token is +*not* the one that was cancelled. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-009 — `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117` | + +**Description** + +`CachedCallAsync` assigns the result of `_storeAndForward.EnqueueAsync(...)` to +`sfResult` and then never reads it — it unconditionally returns +`new ExternalCallResult(true, null, null, WasBuffered: true)`. `EnqueueAsync` can +return `Success == false` (a permanent failure encountered during its +immediate-delivery attempt — `StoreAndForwardService.cs:142`) or `Buffered == false` +(delivered immediately). In both cases the ESG still reports the call as buffered and +successful to the script. A permanent failure surfaced by the S&F immediate attempt is +therefore silently lost instead of being returned to the script as the design requires +("On permanent failure (HTTP 4xx), the error is returned synchronously"). + +**Recommendation** + +Inspect `sfResult`: if `Success == false` return an error `ExternalCallResult`; set +`WasBuffered` from `sfResult.Buffered` rather than hard-coding `true`. (This finding is +partly subsumed by the dispatch redesign in finding 003.) + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-010 — `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50` | + +**Description** + +`GetConnectionAsync` constructs `new SqlConnection(...)` and calls `await +connection.OpenAsync(...)`. If `OpenAsync` throws (unreachable server, bad +credentials, cancellation) the just-created `SqlConnection` instance is never disposed +— the exception propagates and the local reference is lost. While an unopened +`SqlConnection` is lightweight, over many failing calls this is an avoidable leak. The +design doc says `Database.Connection()` failures return an error to the script; the +current code lets a raw `SqlException` escape, which is acceptable, but the leak is +not. + +**Recommendation** + +Wrap the open in a try/catch that disposes the connection before rethrowing: +`try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }`. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:231-245`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:90-97` | + +**Description** + +`ResolveSystemAndMethodAsync` calls `GetAllExternalSystemsAsync()` and then +`GetMethodsByExternalSystemIdAsync()` and filters in memory on every single call; +`ResolveConnectionAsync` calls `GetAllDatabaseConnectionsAsync()` and filters in memory +on every cached write / connection request. At sites this hits the SQLite repository, +and `SiteExternalSystemRepository` re-reads and re-parses the methods JSON each time. +For a hot script path this is unnecessary repeated I/O and allocation. Definitions only +change on deployment, so they are eminently cacheable. + +**Recommendation** + +Add an in-memory cache of system/method/connection definitions keyed by name, +invalidated on artifact deployment. Alternatively use a name-keyed repository lookup +rather than fetch-all-then-filter. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; `_logger` is injected but unused + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22` | + +**Description** + +The design doc states permanent failures are "Logged to Site Event Logging", but +`InvokeHttpAsync` performs no logging on the permanent-failure path. In fact the +injected `ILogger` and `ILogger` fields are +never used at all in either class. Either the logging is expected to happen in the +caller (Script Execution Actor) — in which case the design doc is imprecise about +where — or it is missing. Separately, `IsTransient(HttpStatusCode)` treats any +non-success, non-(5xx/408/429) status as permanent without an explicit comment, which +is a reasonable default but undocumented. + +**Recommendation** + +Add a `_logger.LogWarning` on the permanent-failure path (and a debug log on +transient), or clarify in the design doc that Site Event Logging capture is the +caller's responsibility and remove the unused `_logger` fields. Add a comment in +`ErrorClassifier` documenting the "default to permanent" behaviour. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-013 — `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` | + +**Description** + +`ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem` (default 10) and +`DefaultHttpTimeout` (default 30s) are bound from configuration but neither is read +anywhere. `AddHttpClient()` registers the default factory with no +`ConfigurePrimaryHttpMessageHandler`/`SocketsHttpHandler` `MaxConnectionsPerServer` and +no `Timeout`, so both options have no effect. An operator setting these values gets +them silently ignored — a misleading configuration surface (`DefaultHttpTimeout` is +also referenced by finding 002). + +**Recommendation** + +Either wire the options into a named/typed `HttpClient` registration (set +`MaxConnectionsPerServer` on the primary handler, set `Timeout`), or remove the unused +options to avoid implying behaviour that does not exist. + +**Resolution** + +_Unresolved._ + +### ExternalSystemGateway-014 — Cached-call buffering path and `DatabaseGateway` are untested + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1`, (no `DatabaseGatewayTests.cs`) | + +**Description** + +`ExternalSystemClientTests` covers system/method not-found, success, transient 500 and +permanent 400 for `CallAsync`, plus `CachedCall` not-found and success. It does **not** +cover: the `CachedCall` transient-failure → S&F buffering branch (the most +behaviour-rich path, including the `_storeAndForward == null` fallback and `WasBuffered` +semantics), the `CachedCall` permanent-failure branch, connection-exception +classification (`HttpRequestException` thrown by the handler), `BuildUrl` query-string +construction, and `ApplyAuth` for the apikey/basic variants. There is **no test file +for `DatabaseGateway`** at all — `GetConnectionAsync` not-found, `CachedWriteAsync` +not-found, and the `_storeAndForward == null` guard are entirely uncovered. The +`MockHttpMessageHandler` also does not assert request URL/headers/body, so auth and +URL construction are unverified. + +**Recommendation** + +Add tests for the `CachedCall` transient/buffering paths (with a substituted S&F +service), `DatabaseGateway` not-found and null-S&F guards, and `BuildUrl`/`ApplyAuth` +by asserting on the captured `HttpRequestMessage` in the mock handler. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/HealthMonitoring/findings.md b/code-reviews/HealthMonitoring/findings.md new file mode 100644 index 0000000..516646e --- /dev/null +++ b/code-reviews/HealthMonitoring/findings.md @@ -0,0 +1,420 @@ +# Code Review — HealthMonitoring + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.HealthMonitoring` | +| Design doc | `docs/requirements/Component-HealthMonitoring.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 12 | + +## Summary + +The HealthMonitoring module is small, readable, and broadly faithful to the design +intent: per-interval error counters with atomic read-and-reset, monotonic sequence +numbers with Unix-ms seeding to survive failover, sequence-guarded staleness +rejection, and a 60s offline timeout. However, the review surfaced two recurring +themes. First, **a documented metric is silently unimplemented** — store-and-forward +buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a +test asserts the field is always empty), so the dashboard cannot show the buffer +depth metric the design doc requires. Second, **the central aggregator's in-memory +state model has unguarded shared mutable state**: `SiteHealthState` is a mutable +class whose fields are written by a background timer thread, by `ProcessReport`, and +by `MarkHeartbeat` with no synchronization, and the same live mutable objects are +handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also +mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate, +which the runtime may invoke more than once under contention. Additionally there are +gaps around central self-report offline detection, heartbeats for not-yet-registered +sites being dropped, and missing test coverage for the central report loop, +heartbeat path, and most collector setters. None of the findings are crash-class, +but the concurrency issues are Medium/High and the missing S&F metric is a real +design-adherence gap. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). | +| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. | +| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. | +| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. | +| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. | +| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. | +| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). | +| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). | +| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). | +| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). | + +## Findings + +### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` | + +**Description** + +`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending +messages by category) as a required monitored metric. `SiteHealthCollector` exposes +`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the +method has **no callers** — `_sfBufferDepths` always remains the empty dictionary it +is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and +sets `ParkedMessageCount`, but parked count is a distinct metric from per-category +buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder` +even codifies the unimplemented state as expected behaviour. The result is that the +central dashboard cannot display buffer depth, a documented triage metric. + +**Recommendation** + +Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the +existing parked-count call) using the S&F engine's per-category depth API, or, if the +metric is intentionally deferred, record that decision in the design doc and remove +the dead setter. Update the placeholder test accordingly once implemented. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` | + +**Description** + +`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`, +`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are +mutated from at least three concurrent contexts: `ProcessReport` (caller thread — +ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat +handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The +`ConcurrentDictionary` only protects the dictionary structure, not the objects it +stores. A heartbeat update and the offline-check can interleave on the same +`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct) +and `long` fields are not guaranteed atomic on all platforms — producing torn reads +and lost updates of `IsOnline`/`LastHeartbeatAt`. + +**Recommendation** + +Make state transitions atomic: either guard all reads/writes of a `SiteHealthState` +with a per-site lock, or replace `SiteHealthState` with an immutable record updated +via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is +a single atomic reference swap. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78` | + +**Description** + +The update delegate passed to `AddOrUpdate` mutates the `existing` object in place +(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s +contract explicitly allows the update delegate to be invoked **more than once** under +contention (when the CAS that installs the result loses a race and is retried). Each +invocation mutates the shared object, so a concurrent report for the same site can +observe a half-applied update, and the multi-field assignment is not atomic with +respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended +"only replace if sequence is higher" guard can also be subverted because the +sequence comparison and the field writes are not a single atomic step. + +**Recommendation** + +Have the update delegate return a **new** `SiteHealthState` (record `with` copy) +rather than mutating `existing`, and treat the dictionary value as immutable. +Combined with HealthMonitoring-002, this makes every state transition an atomic +reference swap with no observable intermediate state. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` | + +**Description** + +The heartbeat cadence that offline detection relies on is documented inconsistently. +`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt` +says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s +heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure / +`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout +gives the intended grace without a single authoritative number. + +**Recommendation** + +Pick the correct interval (verify against the heartbeat scheduler in +`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three +comments, ideally referencing the owning component rather than restating a magic number. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` | + +**Description** + +`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for +real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout +only fires on genuine total loss. The synthetic `central` site, however, has no +heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the +30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and +silently skips a cycle on any exception. Consequently, a single skipped/late central +self-report (leader GC pause, brief stall, mid-failover before the new leader's loop +spins up) leaves `central` with no signal for >60s and it is marked offline even +though the central cluster is healthy. The central card thus has no equivalent of +the "one missed report grace" the design doc grants real sites. + +**Recommendation** + +Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for +`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply +a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new +leader starts the report loop promptly on failover. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` | + +**Description** + +The `HealthReportSender` class XML summary states "Sequence numbers are monotonic, +starting at 1, and reset on service restart." The implementation instead seeds +`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first +emitted sequence is a large epoch value, specifically to keep ordering correct across +failover. The summary is therefore stale and contradicts the code. Separately, the +seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than +through an injected `TimeProvider` (which `CentralHealthAggregator` already uses), +making the seeding logic impossible to unit-test deterministically and dependent on +node wall-clock agreement — if one node's clock lags, its post-failover reports can +be silently rejected as stale by the aggregator. + +**Recommendation** + +Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding +strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the +clock dependency is explicit. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` | + +**Description** + +`MarkHeartbeat` returns immediately if the site is not already in `_siteStates` +("registration only happens on report"). Central health state is in-memory only and +not persisted. After a central restart or failover the aggregator starts empty, so +for up to one full report interval (default 30s) every site emits only heartbeats +that are all discarded — the site is reported as *unknown* (absent from +`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is +reachable. This is a visible dashboard regression precisely during the failover +window, which is when operators most need accurate status. + +**Recommendation** + +Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no +`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a +heartbeat arrives for an unknown site, so reachable sites show online immediately +after a central restart. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116` | + +**Description** + +`GetAllSiteStates` copies the dictionary but the copy still holds references to the +same live mutable `SiteHealthState` instances; `GetSiteState` returns the live +instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects +on their own threads while the aggregator's background timer and report handlers +concurrently mutate the very same instances (see HealthMonitoring-002). A UI render +can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport` +from a different update, or a torn `DateTimeOffset`. Callers could also mutate the +shared state, corrupting aggregator state. + +**Recommendation** + +Return immutable snapshots: convert `SiteHealthState` to a record (per +HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each +state into an immutable DTO before returning. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.HealthMonitoring.Tests/` | + +**Description** + +Several behaviours have no automated coverage: +- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report + generation, sequence assignment: no test file at all. +- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports, + online recovery via heartbeat, and the unknown-site drop behaviour + (HealthMonitoring-007): untested. +- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the + existing offline tests only advance time after a report, never exercising the + heartbeat-keeps-alive path the design depends on. +- `SiteHealthCollector` — `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`, + `SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`, + `UpdateConnectionEndpoint`: not reflected-in-report tested. +- `SiteHealthReportReplica` idempotency under double delivery: untested. + +**Recommendation** + +Add tests for the central report loop (with a fake `IClusterNodeProvider`), the +heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector +setters' presence in `CollectReport` output. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}` + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` | + +**Description** + +The cluster-nodes update and parked-message-count query are each wrapped in +`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g. +the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws) +is then completely invisible — every report silently ships with stale cluster nodes +and a parked count of 0, with nothing in the logs to explain the wrong dashboard +values. Bare `catch` with no exception variable also catches `OperationCanceledException` +and would mask shutdown signalling if the awaited call observed the token. + +**Recommendation** + +Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug` +the failure so persistent degradation is diagnosable; avoid swallowing +`OperationCanceledException`. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` | + +**Description** + +`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for +Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller +who registers it will believe actor wiring is in place. No caller currently invokes it. + +**Recommendation** + +Remove the method until it has real behaviour, or throw `NotImplementedException` so +accidental use fails loudly. If the actor model for this component is genuinely +planned, track it in the design doc instead of a half-method. + +**Resolution** + +_Unresolved._ + +### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` | + +**Description** + +`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`, +suppressing nullability. Today every code path that creates a `SiteHealthState` (only +`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the +`null!` declaration tells readers and the compiler the opposite of the real +invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat +(no report yet), this becomes a live `NullReferenceException` risk for UI code that +dereferences `LatestReport`. + +**Recommendation** + +Either make `LatestReport` `required` (matching how it is genuinely always set today) +or make it properly nullable `SiteHealthReport?` and have consumers handle the +"registered, no report yet" case explicitly — consistent with whatever is decided +for HealthMonitoring-007. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/Host/findings.md b/code-reviews/Host/findings.md new file mode 100644 index 0000000..877f5c8 --- /dev/null +++ b/code-reviews/Host/findings.md @@ -0,0 +1,396 @@ +# Code Review — Host + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.Host` | +| Design doc | `docs/requirements/Component-Host.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 11 | + +## Summary + +The Host module is the composition root for the entire ScadaLink system: a single +binary whose behaviour (`Central` vs `Site`) is driven entirely by configuration. The +implementation is generally faithful to `Component-Host.md` — startup validation, +role-based registration, Serilog enrichment, Windows Service support, dead-letter +monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and +backed by a solid test suite (`tests/ScadaLink.Host.Tests`). + +The most significant problem is the readiness endpoint: `/health/ready` runs **all** +registered health checks, including the leader-only `active-node` check, so a fully +operational *standby* central node permanently reports `503` on `/health/ready` — +directly contradicting REQ-HOST-4a, which defines readiness as cluster membership + +DB connectivity (not leadership). Several other findings concern configuration that +is validated-but-never-consumed (`MachineDataDb`), design-doc drift (Akka.Persistence +is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect +seed-node entry in the shipped site config, blocking sync-over-async during startup, +and unguarded string interpolation when building HOCON. None are crash/data-loss +class, but the readiness bug is High because it breaks load-balancer behaviour with +no safe workaround. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). | +| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). | +| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. | +| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). | +| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). | +| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. | +| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). | +| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). | +| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). | +| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). | + +## Findings + +### Host-001 — `/health/ready` includes the leader-only `active-node` check + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Host/Program.cs:135-145` | + +**Description** + +`/health/ready` is mapped with `MapHealthChecks("/health/ready", ...)` and **no +`Predicate`**, so it executes every registered check: `database`, `akka-cluster` +*and* `active-node`. `ActiveNodeHealthCheck` (`Health/ActiveNodeHealthCheck.cs:38`) +returns `Unhealthy` on any node that is not the cluster leader. As a result a +standby central node that is fully operational (cluster member `Up`, database +reachable) still returns `503` on `/health/ready`. This contradicts REQ-HOST-4a, +which defines readiness as cluster membership + DB connectivity + singletons — +explicitly *not* leadership. `/health/active` is the endpoint intended to report +leadership. A load balancer using `/health/ready` to decide whether a node may +serve traffic will permanently treat the standby as unready, defeating failover +readiness. No test covers this: `HealthCheckTests.HealthReady_Endpoint_ReturnsResponse` +only asserts a response is returned, not the standby semantics. + +**Recommendation** + +Add a `Predicate` to the `/health/ready` mapping that excludes the `active-node` +check, e.g. `Predicate = check => check.Name != "active-node"` (or tag the readiness +checks and filter by tag). Add a regression test asserting a non-leader node returns +`200` on `/health/ready`. + +**Resolution** + +_Unresolved._ + +### Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` | + +**Description** + +REQ-HOST-6 states the Host "must configure the Akka.NET actor system using +Akka.Hosting with ... **Persistence**: Configured with the appropriate journal and +snapshot store (SQL for central, SQLite for site)." The HOCON built in +`AkkaHostedService.StartAsync` contains no `akka.persistence` section, no journal and +no snapshot-store plugin, and `ScadaLink.Host.csproj` references neither +`Akka.Persistence.Hosting` nor any persistence plugin (the design doc Dependencies +list `Akka.Persistence.Hosting`). A repo-wide search finds **no** `PersistentActor` / +`ReceivePersistentActor` subclasses — the system deliberately uses custom SQLite +storage services instead. The code is internally consistent, but the design document +is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual +drift that will mislead future maintainers and any audit against REQ-HOST-6. + +**Recommendation** + +Update `Component-Host.md` REQ-HOST-6 and the Dependencies list to remove the +Akka.Persistence requirement (or explicitly state persistence is provided by +component-owned SQLite storage, not Akka.Persistence). If persistence *is* intended, +add the plugin packages and HOCON. Either way, code and doc must agree. + +**Resolution** + +_Unresolved._ + +### Host-003 — Secrets committed in plaintext in `appsettings.Central.json` + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Host/appsettings.Central.json:20-31` | + +**Description** + +`appsettings.Central.json` contains real-looking secrets in plaintext, checked into +source control: SQL Server passwords in the `ConfigurationDb` / `MachineDataDb` +connection strings (`Password=ScadaLink_Dev1#`), an LDAP service-account password +(`LdapServiceAccountPassword: "password"`), and a JWT signing key +(`JwtSigningKey: "scadalink-dev-jwt-signing-key-..."`). Even though these are +intended as development defaults, shipping them in the default config invites them +being reused verbatim in production, and a committed JWT signing key allows anyone +with repo access to forge session tokens. `TrustServerCertificate=true` additionally +disables TLS validation for the SQL connection. + +**Recommendation** + +Move all secrets out of committed `appsettings*.json` into environment variables, +user-secrets, or a secret store. Keep only non-sensitive structural defaults in the +file and document the required environment variables. At minimum add a clear comment +that these values are dev-only and must be overridden, and rotate the JWT key per +environment. + +**Resolution** + +_Unresolved._ + +### Host-004 — Site seed-node list points at the gRPC port, not a remoting port + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Host/appsettings.Site.json:10-19` | + +**Description** + +The shipped site config sets `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`, +but `Cluster:SeedNodes` is `["akka.tcp://scadalink@localhost:8082", +"akka.tcp://scadalink@localhost:8083"]`. The second seed node targets `8083`, which +is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to +join via that seed will try to establish an Akka.Remote TCP association against the +gRPC listener and fail. `StartupValidator` only checks that ≥2 seed nodes exist +(`StartupValidator.cs:54-56`), so this misconfiguration passes validation silently. +For the single-node dev site it is harmless (the first seed succeeds), but it is an +incorrect example that will be copied into multi-node site configs. + +**Recommendation** + +Correct the site seed-node list to reference the two site nodes' *remoting* ports +(e.g. `8082` and `8084`), never the gRPC port. Consider extending `StartupValidator` +to reject a seed node whose port equals this node's `GrpcPort`. + +**Resolution** + +_Unresolved._ + +### Host-005 — Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync` + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:345` | + +**Description** + +`RegisterSiteActors` calls `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` +synchronously, blocking inside the `IHostedService.StartAsync` path. `StartAsync` is +itself declared synchronous (returns `Task.CompletedTask`), so the work cannot be +awaited cleanly. Blocking on async work risks thread-pool starvation during startup +and, if the awaited operation captures a synchronization context, deadlock. It also +hides exceptions behind an `AggregateException` wrapper. + +**Recommendation** + +Make `AkkaHostedService.StartAsync` genuinely `async` and `await +storeAndForwardService.StartAsync(cancellationToken)`. Propagate the +`CancellationToken` and let exceptions surface as the original type. + +**Resolution** + +_Unresolved._ + +### Host-006 — HOCON assembled by unescaped string interpolation + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` | + +**Description** + +The Akka HOCON is built with an interpolated string that injects +`_nodeOptions.NodeHostname`, `_clusterOptions.SeedNodes`, the computed roles, and +`SplitBrainResolverStrategy` directly into the configuration text. Values are not +escaped. A hostname or seed-node string containing a quote, backslash, brace, or +comment sequence would corrupt the HOCON and produce a confusing parse error far from +the real cause; `SplitBrainResolverStrategy` is interpolated without quoting, so a +value with whitespace breaks the document. Building cluster configuration from raw +string concatenation is also harder to maintain than the typed Akka.Hosting builder +the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting"). + +**Recommendation** + +Prefer the `Akka.Hosting` `AddAkka(...)` builder with strongly-typed `WithRemoting`, +`WithClustering`, and split-brain-resolver configuration instead of hand-built HOCON. +If HOCON must be retained, validate/escape interpolated values (especially hostname +and seed nodes) before substitution. + +**Resolution** + +_Unresolved._ + +### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Host/StartupValidator.cs:43-47` | + +**Description** + +REQ-HOST-4 requires: "Site nodes must have `GrpcPort` in valid port range (1–65535) +**and different from `RemotingPort`**." `StartupValidator` validates the GrpcPort +range but never compares it to `RemotingPort`. A site config that sets both ports to +the same value passes validation and then fails opaquely at runtime when Kestrel and +Akka.Remote both try to bind the port. The GrpcPort range check is also skipped +entirely when the key is absent (`grpcPortStr != null`), relying on the +`NodeOptions` default of 8083 — acceptable, but the equality rule is the missing +piece. + +**Recommendation** + +Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including the +8083 default) equals `RemotingPort`, add an error +`"ScadaLink:Node:GrpcPort must differ from RemotingPort"`. + +**Resolution** + +_Unresolved._ + +### Host-008 — `MachineDataDb` is validated and declared but never consumed + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.Host/StartupValidator.cs:33-34`, `src/ScadaLink.Host/DatabaseOptions.cs:6` | + +**Description** + +`StartupValidator` requires a non-empty `ScadaLink:Database:MachineDataDb` connection +string for Central nodes, and `DatabaseOptions` exposes a `MachineDataDb` property, +but a repo-wide search shows the value is never read anywhere outside the Host module +— only `ConfigurationDb` is passed to `AddConfigurationDatabase` +(`Program.cs:83-85`). The Host therefore fails startup if `MachineDataDb` is missing +even though nothing uses it. This is either dead configuration that should be removed +or a missing wiring (a machine-data DbContext that was never registered). + +**Recommendation** + +Determine whether a machine-data store is actually required. If yes, wire it into the +relevant component's DI registration. If no, remove the `MachineDataDb` validation +rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json`. + +**Resolution** + +_Unresolved._ + +### Host-009 — `StartAsync` reports success before role actors are confirmed running + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:127-141` | + +**Description** + +`StartAsync` creates actors with `ActorOf` (a fire-and-forget operation — the actor's +`PreStart` runs asynchronously on its own thread) and then returns +`Task.CompletedTask`. For site nodes, `grpcServer.SetReady(_actorSystem)` is called +synchronously at the end of `RegisterSiteActors`, marking the gRPC server ready even +though `SiteCommunicationActor`, the deployment-manager singleton, and the +`ClusterClient` may not yet have completed their `PreStart`/initial-contact handshake. +REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC +begins accepting connections" — `SiteStreamManager.Initialize` is awaited-equivalent, +but the broader actor graph is not. The window is small and the gRPC server still +rejects streams until `SetReady`, so impact is limited, but readiness is being +asserted optimistically. + +**Recommendation** + +If strict ordering matters, gate `SetReady` on confirmation that +`SiteCommunicationActor` is fully initialized (e.g. an `Ask` round-trip or a +readiness message), or document explicitly that gRPC readiness only guarantees the +actor system exists, not that the cluster handshake has completed. + +**Resolution** + +_Unresolved._ + +### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness) + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Host/Program.cs:112-125` | + +**Description** + +On Central startup the Host opens a DI scope and calls +`MigrationHelper.ApplyOrValidateMigrationsAsync` directly. If the SQL Server is not +yet reachable (common in container orchestration where the DB and app start +together), the call throws, the top-level `catch` logs `Fatal`, and the process +exits. There is no bounded retry/backoff to tolerate a database that is briefly +unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, `503` until +ready) is about *serving traffic*, but the migration step happens before the host +even runs and has no such tolerance. + +**Recommendation** + +Wrap the migration/validation step in a bounded retry with exponential backoff (e.g. +Polly), or move schema apply behind the readiness gate so the process stays up and +reports `503` until the database becomes reachable. + +**Resolution** + +_Unresolved._ + +### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.Host/LoggingOptions.cs:5`, `src/ScadaLink.Host/Program.cs:42-50` | + +**Description** + +`LoggingOptions` exposes a `MinimumLevel` property bound from `ScadaLink:Logging` +(`SiteServiceRegistration.BindSharedOptions`), and both `appsettings.Central.json` +and `appsettings.Site.json` set `"Logging": { "MinimumLevel": "Information" }`. +However Serilog is configured purely via `ReadFrom.Configuration(configuration)`, +which reads the standard `Serilog` section — not `ScadaLink:Logging`. The +`LoggingOptions.MinimumLevel` value is never read by any code, so changing it has no +effect. This is misleading: an operator editing `ScadaLink:Logging:MinimumLevel` +expecting a log-level change will see nothing happen. + +**Recommendation** + +Either consume `LoggingOptions.MinimumLevel` when configuring the Serilog +`LoggerConfiguration` (e.g. set `MinimumLevel.Is(...)` from it), or remove the option +class and the `ScadaLink:Logging` sections and rely solely on the `Serilog` +configuration section. Keep one mechanism, not two. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/InboundAPI/findings.md b/code-reviews/InboundAPI/findings.md new file mode 100644 index 0000000..4d1ccd2 --- /dev/null +++ b/code-reviews/InboundAPI/findings.md @@ -0,0 +1,442 @@ +# Code Review — InboundAPI + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.InboundAPI` | +| Design doc | `docs/requirements/Component-InboundAPI.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 13 | + +## Summary + +The InboundAPI module is small (8 source files) and the happy-path flow — extract +key, validate, deserialize parameters, execute script, serialize result — is clean +and readable. However the review surfaced several real problems concentrated in two +themes: **concurrency** and **security**. The `InboundScriptExecutor` is a singleton +that mutates a plain `Dictionary` from concurrent ASP.NET request threads with no +synchronization, which can corrupt the handler cache or crash the process under load. +On the security side, API-key comparison is a non-constant-time database string +match (timing oracle), compiled scripts run with no enforcement of the documented +script trust model (forbidden APIs such as `System.IO`/`Process`/`Reflection` are +fully reachable), there is no request-body size limit, and the executor's catch-all +swallows `OperationCanceledException` from genuine client disconnects as a "timeout". +Design-doc adherence is also incomplete: the `Database.Connection()` script API +described in the design doc is entirely absent from `InboundScriptContext`, and the +endpoint never enforces that the API is central-only. Testing covers the validators +well but there is no coverage of the HTTP endpoint, concurrency, or recompilation. +None of the findings are data-loss-class, but the concurrency and trust-model issues +are High severity and should be addressed before production use. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `CoerceValue` returns `null` for legitimately-null/`String` values indistinguishably; parameter-definition edge cases noted. | +| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. | +| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. | +| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). | +| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. | +| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. | +| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". | +| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. | +| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. | +| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). | + +## Findings + +### InboundAPI-001 — Singleton script handler cache mutated without synchronization + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:17`, `:32`, `:40`, `:89`, `:123-128` | + +**Description** + +`InboundScriptExecutor` is registered as a singleton (`ServiceCollectionExtensions.cs:11`) +and its handler cache is a plain `Dictionary>` (`InboundScriptExecutor.cs:17`). +`RegisterHandler`, `RemoveHandler`, `CompileAndRegister`, and the lazy-compile path in +`ExecuteAsync` all read and write this dictionary with no lock. ASP.NET serves inbound +API requests on concurrent thread-pool threads, so two requests for an as-yet-uncompiled +method (or a request racing a CLI-triggered `CompileAndRegister`) can mutate the +dictionary concurrently. `Dictionary` is explicitly not safe for concurrent +read/write — this can corrupt internal buckets, throw `InvalidOperationException`, +or return a torn/`null` handler, crashing the request or the process. + +**Recommendation** + +Replace the `Dictionary` with a `ConcurrentDictionary>`, or guard all +access with a lock. For the lazy-compile path use `GetOrAdd` so concurrent first-callers +compile at most once. + +**Resolution** + +_Unresolved._ + +### InboundAPI-002 — Lazy compilation is a check-then-act race with no atomicity + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:123-129` | + +**Description** + +`ExecuteAsync` does `if (!_scriptHandlers.TryGetValue(...)) { CompileAndRegister(method); handler = _scriptHandlers[method.Name]; }`. +Even setting aside the unsynchronized dictionary (InboundAPI-001), this is a +check-then-act sequence: between `TryGetValue` failing and the re-read on line 128, +another thread could `RemoveHandler` the entry, causing the indexer on line 128 to +throw `KeyNotFoundException` — an unhandled-in-context exception that is then caught +only by the broad catch on line 143 and reported to the caller as "Internal script +error". Multiple concurrent first-callers will also each compile the same script +redundantly (wasted Roslyn work). + +**Recommendation** + +Make compile-and-fetch a single atomic operation (`ConcurrentDictionary.GetOrAdd` +with a lazily-evaluated factory, or a per-method lock), and have `CompileAndRegister` +return the handler it produced rather than requiring a separate dictionary read. + +**Resolution** + +_Unresolved._ + +### InboundAPI-003 — API key compared with non-constant-time string equality + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:22-23`, consumed by `src/ScadaLink.InboundAPI/ApiKeyValidator.cs:33` | + +**Description** + +API-key authentication resolves the key with +`FirstOrDefaultAsync(k => k.KeyValue == keyValue)` — an ordinary equality match +translated to a SQL `WHERE KeyValue = @p` comparison. The secret is matched with +ordinary (early-exit) string/SQL comparison rather than a constant-time comparison, +which is a classic timing side-channel for secret material. Combined with the design's +explicit "no rate limiting" decision, an attacker with network access to the central +API can mount a timing attack to recover valid keys. The API key is the *sole* +credential for the inbound API, so this is the primary authentication path. + +**Recommendation** + +Look the key up by a non-secret indexed identifier (e.g. a key prefix/id) or fetch +candidate rows, then verify the secret in-process using +`CryptographicOperations.FixedTimeEquals` over the UTF-8 bytes. Preferably store only +a salted hash of the key value and compare hashes. Avoid leaking secret-length and +match-position timing. + +**Resolution** + +_Unresolved._ + +### InboundAPI-004 — Client disconnect is misreported as a script timeout + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:117-141` | + +**Description** + +`ExecuteAsync` creates a linked CTS from `httpContext.RequestAborted` and the method +timeout, then catches `OperationCanceledException` and unconditionally returns +"Script execution timed out". When the *client* aborts the request (`RequestAborted` +fires), the same exception type is thrown, so a normal client disconnect is logged as +a timeout (`_logger.LogWarning("Script execution timed out ...")`) and an attempt is +made to write a 500 timeout body to an already-gone connection. This pollutes the +failure log (which the design says is reserved for genuine script errors) and obscures +real timeout incidents. + +**Recommendation** + +Distinguish the two cancellation sources: if `cancellationToken` (the request token) +is cancelled, treat it as a client abort — do not log a timeout and do not attempt to +write a response. Only when the timeout CTS fired should the result be "timed out". +Check `cts.Token.IsCancellationRequested && !cancellationToken.IsCancellationRequested` +or use a dedicated timeout `CancellationTokenSource` so the two are separable. + +**Resolution** + +_Unresolved._ + +### InboundAPI-005 — Compiled API scripts run with no script-trust-model enforcement + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:56-93` | + +**Description** + +CLAUDE.md's Akka.NET conventions state the script trust model forbids `System.IO`, +`Process`, `Threading`, `Reflection`, and raw network access. `CompileAndRegister` +compiles arbitrary C# with `CSharpScript.Create` and only restricts the *default +imports* (`WithImports("System", ...)`). Imports are a convenience, not a sandbox — a +script can still fully-qualify any type (`System.IO.File.Delete(...)`, +`System.Diagnostics.Process.Start(...)`, `System.Reflection`, raw `Socket`) because +the core framework assemblies are referenced and Roslyn scripting performs no API +allow/deny-listing. Inbound API scripts execute on the central node with the host +process's privileges, so a malicious or buggy method definition has full host access. +Note the Design role authors these scripts (less trusted than Admin), making +enforcement material. + +**Recommendation** + +Add a compile-time analyzer/`SyntaxWalker` (as the Site Runtime does for instance +scripts) that rejects forbidden namespaces/types before registering a handler, and/or +run scripts under a constrained boundary. At minimum, share the Site Runtime's +forbidden-API checker so the trust model is enforced consistently. Reject the method +(and log) when a violation is found instead of registering it. + +**Resolution** + +_Unresolved._ + +### InboundAPI-006 — No request body size limit on the inbound endpoint + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:54-62` | + +**Description** + +`HandleInboundApiRequest` calls `JsonDocument.ParseAsync(httpContext.Request.Body, ...)` +with no explicit body-size cap and no `[RequestSizeLimit]`/endpoint metadata. Although +Kestrel has a default max request body size, this endpoint accepts arbitrary JSON from +external systems, fully buffers it into a `JsonDocument`, and then `Clone()`s the +root element (`:61`) which materializes the entire document on the heap. With no rate +limiting (a deliberate design choice) a single caller can drive large allocations. +Deep/wide JSON also makes the `CoerceValue` `object`/`list` deserialization +(`ParameterValidator.cs:113,117`) expensive. + +**Recommendation** + +Set an explicit, modest body-size limit on the endpoint +(`.WithMetadata(new RequestSizeLimitAttribute(...))` or +`IHttpMaxRequestBodySizeFeature`) and consider a `JsonDocumentOptions` `MaxDepth`. +Reject oversized bodies with 413 before buffering. + +**Resolution** + +_Unresolved._ + +### InboundAPI-007 — `Database.Connection()` script API from the design doc is not implemented + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:155-170` | + +**Description** + +`Component-InboundAPI.md` ("Script Runtime API -> Database Access") specifies +`Database.Connection("connectionName")` as an available script capability for +querying the configuration/machine-data databases. `InboundScriptContext` exposes only +`Parameters`, `Route`, and `CancellationToken` — there is no `Database` member. Any +method script that follows the documented API will fail to compile. Either the code +is incomplete or the design doc is stale; the two must be reconciled. + +**Recommendation** + +If database access is in scope, add a `Database` property to `InboundScriptContext` +backed by a connection-factory service. If it is not, remove the "Database Access" +section from `Component-InboundAPI.md` so the design doc stops advertising an absent +API. + +**Resolution** + +_Unresolved._ + +### InboundAPI-008 — Inbound API endpoint not restricted to the active central node + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:19-23`, `src/ScadaLink.Host/Program.cs:149` | + +**Description** + +The design states the Inbound API is "Central cluster only (active node)" and "fails +over with it". `MapInboundAPI` registers `POST /api/{methodName}` unconditionally, and +`Program.cs` maps it inside the central-role branch but with no active-node gating — +unlike `/health/active` which has an `active-node` predicate. A standby central node +will happily serve inbound API calls, executing scripts and `Route.To()` calls from a +non-leader, which can race the active node or run against stale singleton state. + +**Recommendation** + +Gate the endpoint on active-node status (reuse the cluster `active-node` health check +or a leader-state check) and return 503 on the standby, so Traefik/clients only reach +the live node — consistent with how the Management API and `/health/active` are +treated. + +**Resolution** + +_Unresolved._ + +### InboundAPI-009 — Failed compilation is retried on every subsequent request + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:123-128` | + +**Description** + +When a method's script fails to compile, `CompileAndRegister` returns `false` and +nothing is stored in `_scriptHandlers`. Every subsequent call to that method re-enters +the lazy-compile branch and recompiles the broken script via Roslyn from scratch. +Roslyn compilation is expensive; a single broken method definition repeatedly invoked +by an external caller (no rate limiting) becomes a CPU amplification vector. + +**Recommendation** + +Cache the compilation *failure* (e.g. store a sentinel handler that immediately +returns the compile error, or keep a `HashSet` of known-bad method names with the +diagnostic) so a broken script is compiled at most once until the definition is +updated via `CompileAndRegister`. + +**Resolution** + +_Unresolved._ + +### InboundAPI-010 — `ParameterValidator` ignores extra body fields and cannot validate Object/List element types + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/ParameterValidator.cs:64-90`, `:112-118` | + +**Description** + +Two related correctness gaps: (1) The validator iterates only over *defined* +parameters; any extra top-level fields in the request body are silently ignored +rather than reported, so callers get no feedback on typo'd parameter names. (2) For +`Object` and `List` types the validator only checks the JSON *kind* (`Object`/`Array`) +and then blindly `JsonSerializer.Deserialize`s the raw text — the design's extended +type system describes Objects as "named structure with typed fields" and Lists as +collections "of objects or primitive types", but no field-level or element-level type +validation is performed. Invalid nested structures pass validation and surface only +as runtime script errors. + +**Recommendation** + +Optionally warn/400 on unexpected body fields. For the extended types, either parse a +richer `ParameterDefinition` (with nested field definitions / element type) and +validate recursively, or document explicitly that Object/List are validated only for +shape — and update the design doc to match. + +**Resolution** + +_Unresolved._ + +### InboundAPI-011 — Method-existence check leaks to unapproved callers (enumeration oracle) + +| | | +|--|--| +| Severity | Low | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/ApiKeyValidator.cs:39-52` | + +**Description** + +`ValidateAsync` returns 400 `Method '{methodName}' not found` when the method does not +exist, but 403 `API key not approved for this method` when it exists but the key is +not approved. A caller holding any valid enabled key can therefore enumerate which +method names exist on the central API by observing 400-vs-403 responses. The error +message also echoes the caller-supplied `methodName` back verbatim into the JSON +response (`EndpointExtensions.cs:47`), a minor reflected-input concern. + +**Recommendation** + +Return an indistinguishable response (e.g. 403/404) for both "method not found" and +"key not approved" so existence is not observable to unapproved callers. Avoid echoing +raw caller input in error bodies, or sanitize it. + +**Resolution** + +_Unresolved._ + +### InboundAPI-012 — `ParameterDefinition` POCO declared in the component project, not Commons + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/ParameterValidator.cs:128-133` | + +**Description** + +`ParameterDefinition` is a persistence-/contract-shaped POCO: it is the deserialized +form of `ApiMethod.ParameterDefinitions` (a column in the configuration database) and +describes the public API contract. CLAUDE.md's code-organization rules place +persistence-ignorant entity/contract types in `ScadaLink.Commons`. Defining it inside +the InboundAPI project means any other component that needs to read or produce method +parameter definitions (e.g. Central UI's method editor, CLI, Management Service) +cannot share the type and will duplicate it. + +**Recommendation** + +Move `ParameterDefinition` (and a matching return-definition type, if added) to +`ScadaLink.Commons` under the InboundApi entity/types namespace so it is shared by all +components that work with method definitions. + +**Resolution** + +_Unresolved._ + +### InboundAPI-013 — `ApiKeyValidationResult.NotFound` factory returns HTTP 400, contradicting its name + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/ApiKeyValidator.cs:78-79` | + +**Description** + +The static factory is named `NotFound` and is used for the "method not found" case, +but it builds a result with `StatusCode = 400` (Bad Request), not 404. The name +strongly implies 404 and will mislead future maintainers; `EndpointExtensions` +faithfully propagates whatever status code the factory sets, so the misnaming directly +affects the wire contract. + +**Recommendation** + +Rename the factory to match its behaviour (e.g. `BadRequest`) or change the status +code to 404 if that is the intended contract — and document the chosen "method not +found" status in `Component-InboundAPI.md`'s Error Handling section, which currently +does not list it. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/ManagementService/findings.md b/code-reviews/ManagementService/findings.md new file mode 100644 index 0000000..09f10b5 --- /dev/null +++ b/code-reviews/ManagementService/findings.md @@ -0,0 +1,432 @@ +# Code Review — ManagementService + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.ManagementService` | +| Design doc | `docs/requirements/Component-ManagementService.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 13 | + +## Summary + +The ManagementService module is a thin command-dispatch layer: a single `ManagementActor` +fronts every administrative operation, an HTTP `POST /management` endpoint authenticates and +forwards to it, and a SignalR `DebugStreamHub` provides real-time debug streaming. The code +is consistently structured and the role-based authorization gate (`GetRequiredRole`) is +broadly correct and well tested. However, the review surfaced a significant **security +theme**: site-scope enforcement, which the design document requires for instance- and +site-targeted Deployment operations, is applied inconsistently — several query handlers and +all remote-query/debug handlers perform no site-scope check at all, allowing a site-scoped +Deployment user to read or act on sites outside their scope. A second theme is **Akka.NET +convention drift**: the actor offloads all work to `Task.Run` instead of using `PipeTo`, +declares no supervision strategy, and the contract messages carry a loosely-typed `object` +payload. There are also resource-management defects in the HTTP endpoint (`JsonDocument` +instances never disposed) and dead/unused configuration. None of the findings are +crash-class, but the site-scope gaps are High severity because they are a real +authorization bypass with no workaround. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | + | `HandleResolveRoles` builds `RoleMapper` by hand; `ResolveRolesCommand` is a stale dispatch path. See 008, 011. | +| 2 | Akka.NET conventions | + | `Task.Run` instead of `PipeTo`, no supervision strategy, `object`-typed message payload. See 004, 005, 012. | +| 3 | Concurrency & thread safety | + | Actor is stateless so `Task.Run` does not corrupt state, but it defeats actor-thread serialization (004). `Sender` correctly captured to a local before the closure. | +| 4 | Error handling & resilience | + | Exceptions are caught and mapped uniformly; `SiteScopeViolationException` mapped to `Unauthorized`. Audit-logging consistency issue noted in 009. | +| 5 | Security | + | Site-scope enforcement missing on query/remote/debug paths. See 001, 002, 003. | +| 6 | Performance & resource management | + | `JsonDocument` instances never disposed in the HTTP endpoint. See 006. | +| 7 | Design-document adherence | + | Design doc states remote queries enforce site scoping; code does not. `ManagementServiceOptions` reserved-for-future config is unused. See 001, 010. | +| 8 | Code organization & conventions | + | Mixed serializers (Newtonsoft in actor, System.Text.Json in endpoint); inconsistent audit logging across mutations. See 007, 009. | +| 9 | Testing coverage | + | Authorization is well covered; site-scope enforcement, the HTTP endpoint, `DebugStreamHub`, and remote-query handlers have no tests. See 013. | +| 10 | Documentation & comments | + | XML docs are accurate where present; `ManagementServiceOptions` and `ResolveRolesCommand` paths are undocumented dead code (010, 011). | + +## Findings + +### ManagementService-001 — Remote-query and debug-snapshot handlers bypass site-scope enforcement + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1465`, `:1481`, `:1493`, `:641`, `:649` | + +**Description** + +The design document (`Component-ManagementService.md`, Authorization section) states that for +Deployment users "Site scoping is enforced for site-scoped Deployment users" and lists +"debug snapshot, parked message queries, site event log queries" among the Deployment-role +operations. `HandleQueryEventLogs`, `HandleQueryParkedMessages`, `HandleDebugSnapshot`, +`HandleRetryParkedMessage`, and `HandleDiscardParkedMessage` make no call to `EnforceSiteScope` +or `EnforceSiteScopeForInstance`. A Deployment user scoped to site A can therefore query event +logs / parked messages of site B, retry or discard another site's parked messages, and pull a +debug snapshot of any instance simply by supplying a different `SiteIdentifier` or `InstanceId`. +This is an authorization bypass with no workaround. + +**Recommendation** + +In each of these handlers resolve the target site and call site-scope enforcement before +delegating to `CommunicationService`. For the `SiteIdentifier`-keyed handlers, look up the +`Site` by identifier and enforce against `Site.Id`; for `DebugSnapshotCommand` the instance +is already loaded — call `EnforceSiteScope(user, instance.SiteId)` (which requires threading +`AuthenticatedUser` into these handlers, currently dropped). + +**Resolution** + +_Unresolved._ + +### ManagementService-002 — Single-entity query handlers leak data across site scope + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:510`, `:673`, `:733`, `:774`, `:631`, `:624` | + +**Description** + +`HandleListInstances` and `HandleListSites` correctly filter their results by the user's +`PermittedSiteIds`, but the single-entity query handlers do not. `HandleGetInstance`, +`HandleGetSite`, `HandleListAreas`, and `HandleGetDataConnection` fetch by ID with no +site-scope check, so a site-scoped Deployment user can read any instance, site, area tree, +or data connection by ID even though that site is excluded from their scope. The list +endpoints having a filter while the get-by-id endpoints do not is an inconsistency that +undermines the scoping model. (`HandleGetDeploymentDiff` and `HandleListInstanceAlarmOverrides` +do enforce scope, confirming the omission elsewhere is unintentional.) + +**Recommendation** + +Apply `EnforceSiteScopeForInstance` in `HandleGetInstance`, and `EnforceSiteScope` against +the resolved site ID in `HandleGetSite`, `HandleListAreas`, and `HandleGetDataConnection` +(for data connections, scope by the connection's `SiteId`). + +**Resolution** + +_Unresolved._ + +### ManagementService-003 — DebugStreamHub.SubscribeInstance performs no per-instance authorization + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/DebugStreamHub.cs:104` | + +**Description** + +`OnConnectedAsync` authenticates the WebSocket connection and verifies the caller holds the +`Deployment` role, but `SubscribeInstance(int instanceId)` accepts any instance ID and starts +a stream without checking that the authenticated user is scoped to that instance's site. A +site-scoped Deployment user can therefore subscribe to the live debug stream (attribute +values, alarm states) of an instance belonging to a site outside their scope. This is the +streaming equivalent of finding 001/002. + +**Recommendation** + +Resolve the instance's site inside `SubscribeInstance` and reject the subscription if the +authenticated user's permitted-site set does not include it. The authenticated identity +established in `OnConnectedAsync` must be persisted on the connection (e.g. in +`Context.Items`) so it is available to `SubscribeInstance`. + +**Resolution** + +_Unresolved._ + +### ManagementService-004 — Actor offloads work to Task.Run instead of using PipeTo + +| | | +|--|--| +| Severity | Medium | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:61` | + +**Description** + +`HandleEnvelope` runs every command on a thread-pool thread via `Task.Run(async () => ...)` +and replies from inside the continuation. This is the anti-pattern the project's Akka.NET +conventions warn against — the canonical approach is to start the async work and `PipeTo` +its result back to `Self`/`Sender`. Although `Sender` is correctly copied to a local before +the closure, the current code: (a) lets multiple commands execute fully concurrently with no +actor-thread serialization, so the actor provides no ordering or back-pressure guarantees +and is an actor in name only; (b) cannot be paused, supervised, or made to honour a mailbox +bound; (c) is shielded from synchronous faults only because every path is inside the +try/catch — any future code path that throws synchronously before the `Task.Run` body would +escape it. + +**Recommendation** + +Replace `Task.Run` with a method that returns the `Task` and `PipeTo` the mapped result +(`ManagementSuccess`/`ManagementError`/`ManagementUnauthorized`) back to the captured sender, +mapping faults in the `PipeTo` failure continuation. If genuine parallelism is desired, make +that explicit with a router/dispatcher rather than ad-hoc `Task.Run`. + +**Resolution** + +_Unresolved._ + +### ManagementService-005 — ManagementActor declares no supervision strategy + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:33` | + +**Description** + +The project conventions call for explicit supervision strategies (Resume for coordinator +actors). `ManagementActor` is a long-lived coordinator-style actor but overrides no +`SupervisorStrategy` and defines no `PreRestart`/`PostRestart` behaviour. In practice it +spawns no children so the default strategy is rarely exercised, but an explicit strategy +should still be declared for clarity and to match the documented convention; it also matters +if children are added later (e.g. if finding 004 introduces worker actors). + +**Recommendation** + +Add an explicit `protected override SupervisorStrategy SupervisorStrategy()` returning a +Resume-based strategy, consistent with other central coordinator actors. + +**Resolution** + +_Unresolved._ + +### ManagementService-006 — JsonDocument instances never disposed in the HTTP endpoint + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementEndpoints.cs:83`, `:112` | + +**Description** + +`JsonDocument` is `IDisposable` (it rents buffers from a pooled `ArrayPool`). `HandleRequest` +parses the request body into `doc` at line 83 and never disposes it, and line 112 +(`JsonDocument.Parse("{}")`) allocates a second document inline that is also never disposed. +Every management HTTP call therefore leaks pooled buffers, increasing GC pressure and pool +churn under load. + +**Recommendation** + +Wrap the parsed document in `using var doc = ...`. For the empty-payload fallback, avoid +allocating a `JsonDocument` entirely — deserialize from the literal string `"{}"`/an empty +object, or restructure so the fallback path does not parse a throwaway document. + +**Resolution** + +_Unresolved._ + +### ManagementService-007 — Inconsistent and cycle-prone serialization of repository entities + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:67`; `src/ScadaLink.ManagementService/ManagementEndpoints.cs:113` | + +**Description** + +The actor serializes every command result with `Newtonsoft.Json` (`JsonConvert.SerializeObject`) +while the HTTP endpoint deserializes payloads with `System.Text.Json`. Beyond the +inconsistency, `JsonConvert.SerializeObject` is applied directly to EF-backed entities +returned by repositories (e.g. `Site`, `DataConnection`, `NotificationList` with a +`Recipients` collection, `Template` with children). With default Newtonsoft settings any +bidirectional navigation property produces a `JsonSerializationException` for self-referencing +loops, and even without cycles this serializes lazy/navigation state the CLI does not expect. + +**Recommendation** + +Standardise on one serializer (the rest of the HTTP path uses `System.Text.Json`). Serialize +explicit DTOs / projections rather than EF entities, or configure +`ReferenceLoopHandling.Ignore` and ignore navigation properties. Verify that handlers +returning rich entity graphs (`HandleGetTemplate`, `HandleUpdateNotificationList`) round-trip +correctly. + +**Resolution** + +_Unresolved._ + +### ManagementService-008 — HandleResolveRoles constructs RoleMapper manually instead of via DI + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:285` | + +**Description** + +Every other handler resolves its collaborators from the scoped `IServiceProvider`. +`HandleResolveRoles` instead does `new RoleMapper(sp.GetRequiredService())`, +bypassing DI. If `RoleMapper` ever gains a dependency, caching, or options, this hand-built +instance silently diverges from the DI-registered one. It is also inconsistent with +`ManagementEndpoints`, which resolves `RoleMapper` from DI. + +**Recommendation** + +Resolve `RoleMapper` via `sp.GetRequiredService()` like every other dependency. + +**Resolution** + +_Unresolved._ + +### ManagementService-009 — Audit logging applied inconsistently across mutating handlers + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:357`, `:1134`, `:1085`, `:526`, `:1275` | + +**Description** + +The design doc states "All mutating operations are audit logged." Some handlers call +`AuditAsync` explicitly (`HandleCreateInstance`, `HandleCreateSite`, all repository-direct +external-system/notification/security/area mutations), but the handlers that delegate to a +domain service do **not** — `HandleCreateTemplate`/`HandleUpdateTemplate`/`HandleDeleteTemplate`, +all template-member handlers (`HandleAddAttribute` ... `HandleDeleteComposition`), template-folder +handlers, shared-script handlers, `HandleDeployArtifacts`, `HandleDeployInstance`, +`HandleEnableInstance`/`Disable`/`Delete`, and the instance-binding/override handlers. This is +correct only if every one of those services performs its own audit logging internally; the +mixed pattern makes that impossible to verify by reading this module and creates a real risk +of silent audit gaps for template authoring and deployment operations. + +**Recommendation** + +Decide on one layer that owns auditing. Either route all mutations through services that audit +internally (and remove the explicit `AuditAsync` calls here), or audit uniformly in the actor +after every successful mutation. Document the chosen contract so the inconsistency cannot +recur, and confirm template/deployment services actually audit. + +**Resolution** + +_Unresolved._ + +### ManagementService-010 — ManagementServiceOptions.CommandTimeout is defined but never used + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementServiceOptions.cs:5`; `src/ScadaLink.ManagementService/ManagementEndpoints.cs:16` | + +**Description** + +`ManagementServiceOptions.CommandTimeout` is bound from configuration in +`ServiceCollectionExtensions`, but no code reads it. The HTTP endpoint instead hard-codes +`AskTimeout = TimeSpan.FromSeconds(30)`. The design doc describes the options section as +"Reserved for future configuration — e.g., command timeout overrides", yet a concrete +`CommandTimeout` property already exists and is silently ignored, so an operator who sets it +in `appsettings.json` gets no effect. + +**Recommendation** + +Either consume `ManagementServiceOptions.CommandTimeout` in `ManagementEndpoints.HandleRequest` +(inject `IOptions`), or remove the property until it is wired up so +configuration cannot be set with no effect. + +**Resolution** + +_Unresolved._ + +### ManagementService-011 — ResolveRolesCommand dispatch path is stale dead code + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:273`, `:283` | + +**Description** + +The design doc states the HTTP endpoint "collapses the CLI's previous two-step flow +(ResolveRoles + actual command) into a single HTTP round-trip", and indeed `ManagementEndpoints` +performs LDAP auth and role resolution itself before dispatching. The `ResolveRolesCommand` +case in `DispatchCommand` is therefore unreachable from the HTTP path. It remains reachable +only via a raw ClusterClient sender, but a caller able to send `ResolveRolesCommand` could +enumerate role mappings for arbitrary LDAP groups with no role requirement +(`GetRequiredRole` returns null for it) — a minor information-disclosure surface for a path +the design says no longer exists. + +**Recommendation** + +If the two-step flow is genuinely retired, remove `ResolveRolesCommand`, its handler, and the +class. If it must remain for non-HTTP clients, document why and confirm exposing role-mapping +data unauthenticated is intended. + +**Resolution** + +_Unresolved._ + +### ManagementService-012 — ManagementEnvelope carries a loosely-typed object payload + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.Commons/Messages/Management/ManagementEnvelope.cs:7`; `src/ScadaLink.ManagementService/ManagementActor.cs:132` | + +**Description** + +`ManagementEnvelope.Command` is typed `object`, so the actor relies on a large open-ended +`switch` with a `NotSupportedException` default for unknown types. While the individual +command records are immutable, `object` defeats compile-time exhaustiveness — adding a new +command record produces no compiler signal that `DispatchCommand` (and `GetRequiredRole`) +need updating, and a typo or unregistered command surfaces only as a runtime exception. The +message contract is also harder to evolve safely under the additive-only rule. + +**Recommendation** + +Introduce a marker interface (e.g. `IManagementCommand`) implemented by every command record +and type the envelope payload as that interface. This documents the contract, lets analyzers +flag unhandled cases, and keeps `ManagementCommandRegistry`'s reflection scan precise. + +**Resolution** + +_Unresolved._ + +### ManagementService-013 — No tests for site-scope enforcement, the HTTP endpoint, or DebugStreamHub + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` | + +**Description** + +`ManagementActorTests` covers role-based authorization, success/error mapping, and correlation +IDs thoroughly, but several critical paths are untested: (a) site-scope enforcement — +`EnforceSiteScope`/`EnforceSiteScopeForInstance` and `SiteScopeViolationException` -> `Unauthorized` +mapping have no test, which is why the gaps in findings 001/002 went unnoticed; (b) +`ManagementEndpoints` — Basic Auth decoding, malformed-header handling, LDAP/role resolution, +command deserialization, and HTTP status mapping have zero coverage; (c) `DebugStreamHub` +authentication, subscribe/unsubscribe lifecycle, and `ManagementCommandRegistry.Resolve` are +untested. The `Envelope` test helper always passes `Array.Empty()` for permitted +sites, so no test ever exercises a site-scoped user. + +**Recommendation** + +Add tests that exercise a site-scoped Deployment user against in-scope and out-of-scope +targets for instance and site operations, asserting `ManagementUnauthorized` on violations. +Add `WebApplicationFactory`-based tests for `ManagementEndpoints` covering auth failures, +malformed bodies, unknown commands, and the 200/400/403/401/504 mappings. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/NotificationService/findings.md b/code-reviews/NotificationService/findings.md new file mode 100644 index 0000000..821c5ad --- /dev/null +++ b/code-reviews/NotificationService/findings.md @@ -0,0 +1,306 @@ +# Code Review — NotificationService + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.NotificationService` | +| Design doc | `docs/requirements/Component-NotificationService.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 12 | + +## Summary + +The NotificationService module is small (6 source files) and structurally clean: it +abstracts the SMTP client behind an interface, isolates the OAuth2 token lifecycle, +and integrates with the Store-and-Forward Engine for transient-failure buffering. +However, the review surfaced several substantive defects. The most serious is that +**no Store-and-Forward delivery handler is ever registered for the `Notification` +category** — buffered notifications are persisted but never retried or delivered, +silently losing every notification that hit a transient SMTP failure. Error +classification is fragile (substring matching on exception messages) and is +applied inconsistently between `SendAsync` and `DeliverAsync`. `DeliverAsync` also +contains a resource-management bug that constructs and leaks two SMTP clients per +call. Secondary themes: the `OAuth2TokenService` singleton caches a single token +keyed to no credential identity (incorrect if multiple SMTP configs exist), several +design-doc requirements are unimplemented (connection timeout, max concurrent +connections, TLS `SSL`/`None` modes), and credentials are stored and passed as +plaintext `string` values. Test coverage exercises the happy path and the main +error branches but misses the OAuth2 delivery path, the permanent-classification +fallback in `DeliverAsync`, and concurrency on the token cache. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Double SMTP client construction; `Auto` socket option for non-TLS; `TimeoutException`/`OperationCanceledException` misclassified. | +| 2 | Akka.NET conventions | ☑ | No actors in this module (`AddNotificationServiceActors` is a no-op); delivery is a plain DI service. No Akka-specific issues. | +| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` is a singleton with a shared mutable token cache; double-checked locking present but cache key is wrong (NS-006). | +| 4 | Error handling & resilience | ☑ | Critical: no S&F delivery handler registered for `Notification` (NS-001). Fragile substring error classification (NS-002, NS-003). | +| 5 | Security | ☑ | Credentials handled as plaintext strings; OAuth2 client secret in DB credential blob; no recipient address validation. | +| 6 | Performance & resource management | ☑ | Two `ISmtpClientWrapper` instances created per send, one leaked; connection not pooled; `MaxConcurrentConnections` unenforced. | +| 7 | Design-document adherence | ☑ | Connection timeout, max concurrent connections, and TLS `SSL`/`None` modes from the design doc are not implemented. | +| 8 | Code organization & conventions | ☑ | `SmtpPermanentException` in the wrong file; `SmtpConfiguration` POCO has non-nullable strings with no initializer (compiler-warning risk). | +| 9 | Testing coverage | ☑ | Happy path and main error branches covered; OAuth2 delivery path, `DeliverAsync` permanent fallback, and token-cache concurrency untested. | +| 10 | Documentation & comments | ☑ | XML comment on `DeliverAsync` ("Throws on failure") and the misleading "OAuth2 token refresh if needed" comment do not match behaviour. | + +## Findings + +### NotificationService-001 — Buffered notifications are never retried (no S&F delivery handler) + +| | | +|--|--| +| Severity | Critical | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:8` | + +**Description** + +On a transient SMTP failure the service calls `_storeAndForward.EnqueueAsync(StoreAndForwardCategory.Notification, ...)`. The Store-and-Forward Engine only delivers (immediately or on retry sweep) a category for which a delivery handler has been registered via `StoreAndForwardService.RegisterDeliveryHandler`. A repo-wide search shows the `Notification` category handler is never registered anywhere — `StoreAndForwardCategory.Notification` appears only in this module's `EnqueueAsync` call. As a result, every buffered notification falls into the `RetryMessageAsync` "No delivery handler for category" branch (`StoreAndForwardService.cs:201-204`), which logs a warning and returns without ever delivering or removing the message. Buffered notifications accumulate in SQLite forever and are never sent. This silently loses every notification that hit a transient failure, while `SendAsync` returns `Success=true, WasBuffered=true`, telling the caller the notification is safely queued. This directly violates the design doc's "integrates with the Store-and-Forward Engine for reliable delivery" guarantee. + +**Recommendation** + +Register a delivery handler for `StoreAndForwardCategory.Notification` during startup that deserializes the buffered payload (`ListName`, `Subject`, `Message`), re-resolves the list/recipients/SMTP config, and re-attempts `DeliverAsync`, returning `true` on success, `false` on permanent failure, and throwing on transient failure. Wire it in `AddNotificationService` or the host bootstrap. Add an integration test covering the buffer-then-retry-then-deliver round trip. + +**Resolution** + +_Unresolved._ + +### NotificationService-002 — `TimeoutException`/`OperationCanceledException` misclassified as transient + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:157-167` | + +**Description** + +`IsTransientSmtpError` treats `OperationCanceledException` (and its subtype `TaskCanceledException`) as a transient SMTP error. When the caller passes a `CancellationToken` that is cancelled — e.g. the Script Execution Actor is stopped, or the script times out — the resulting `OperationCanceledException` is caught by the `catch ... when (IsTransientSmtpError(ex))` clause and the notification is buffered as if SMTP had failed. A deliberate cancellation should propagate, not be silently buffered for retry. The same clause classifies any `IOException` as transient even though `IOException` covers unrelated failures (e.g. a serialization stream error). Additionally, `OperationCanceledException` raised by token cancellation in the OAuth2 path would be miscategorised the same way. + +**Recommendation** + +Re-throw `OperationCanceledException`/`TaskCanceledException` when `cancellationToken.IsCancellationRequested` is true rather than classifying it as transient. Narrow `IOException` handling to SMTP-specific I/O failures, or rely on MailKit's typed exceptions (`SmtpCommandException`, `SmtpProtocolException`, `ServiceNotConnectedException`) instead of broad base types. + +**Resolution** + +_Unresolved._ + +### NotificationService-003 — Error classification by substring matching on exception messages is fragile + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:144-147`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:163-166` | + +**Description** + +Transient/permanent classification depends on `ex.Message.Contains("5.")`, `Contains("4.")`, `Contains("550")`, `Contains("421")`, etc. This is unreliable: (a) `Message.Contains("5.")` matches any message containing the literal "5." anywhere — e.g. a host name `smtp5.example.com`, a version string, or a path — producing false permanent classification; (b) `Contains("4.")` likewise matches `"v4.0"` or an IP address octet; (c) MailKit exposes the actual SMTP status code on `SmtpCommandException.StatusCode`, which is the correct, locale-independent source of truth and is being ignored; (d) message text is culture/version-dependent and not part of any stable contract. Misclassification has real consequences: a permanent failure misread as transient floods the S&F buffer (which the design doc explicitly says must be prevented), and a transient failure misread as permanent loses the notification. + +**Recommendation** + +Classify on MailKit's typed exceptions and `SmtpCommandException.StatusCode` (4xx → transient, 5xx → permanent), and `SocketException`/`SmtpProtocolException`/connection-refused → transient. Remove all `Message.Contains` checks. + +**Resolution** + +_Unresolved._ + +### NotificationService-004 — `DeliverAsync` constructs two SMTP clients and leaks the used one + +| | | +|--|--| +| Severity | High | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:118-119` | + +**Description** + +```csharp +using var client = _smtpClientFactory() as IDisposable; +var smtp = _smtpClientFactory(); +``` + +The factory is invoked twice, creating two separate `MailKitSmtpClientWrapper` instances (each owning a real `SmtpClient` with a socket). The first instance is assigned to `client` and disposed by the `using`, but it is never used. The second instance, `smtp`, is the one actually connected, authenticated, used to send, and `DisconnectAsync`'d — but it is never `Dispose`d. `MailKitSmtpClientWrapper` implements `IDisposable` and wraps an unmanaged socket; the connected client is leaked on every send. `DisconnectAsync` closes the connection but does not dispose the `SmtpClient`. Over time this leaks sockets/handles. + +**Recommendation** + +Create exactly one client and dispose the one that is actually used: +`using var smtp = _smtpClientFactory();` then cast to `IDisposable` only if needed (the factory's `Func` should ideally return a type that the `using` can dispose directly — consider having `ISmtpClientWrapper` extend `IAsyncDisposable`/`IDisposable`). + +**Resolution** + +_Unresolved._ + +### NotificationService-005 — Non-TLS path uses `SecureSocketOptions.Auto`, contradicting the requested mode + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:18`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:123` | + +**Description** + +`ConnectAsync` maps `useTls` to either `SecureSocketOptions.StartTls` or `SecureSocketOptions.Auto`. `useTls` is computed in `DeliverAsync` as `TlsMode == "starttls"`. So a configuration of `TlsMode = "none"` produces `useTls = false` → `SecureSocketOptions.Auto`, which lets MailKit opportunistically negotiate TLS — the opposite of "None". Worse, the design doc defines three TLS modes — `None`, `StartTLS`, `SSL` — but the code collapses them to a single boolean, so `SSL` (implicit TLS, typically port 465) is treated identically to `None`/`Auto` and the SSL mode is effectively unsupported. The `bool useTls` parameter cannot represent the three-state requirement. + +**Recommendation** + +Pass the `TlsMode` string (or a `TlsMode` enum) through to the wrapper and map explicitly: `None` → `SecureSocketOptions.None`, `StartTLS` → `SecureSocketOptions.StartTls`, `SSL` → `SecureSocketOptions.SslOnConnect`. Validate the configured value and reject unknown modes. + +**Resolution** + +_Unresolved._ + +### NotificationService-006 — OAuth2 token cache is keyed to nothing; wrong token returned when multiple SMTP configs exist + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/OAuth2TokenService.cs:14-15`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:30-35` | + +**Description** + +`OAuth2TokenService` is registered as a singleton and stores a single `_cachedToken`/`_tokenExpiry` pair. `GetTokenAsync` ignores the `credentials` argument when deciding whether the cache is valid — it only checks expiry. If two SMTP configurations with different tenant/client credentials are ever used (the repository's `GetAllSmtpConfigurationsAsync` returns a list, implying multiple configs are possible), the second caller receives the first caller's token, which will fail authentication against the second tenant. Even with a single config today this is a latent correctness bug and makes the service's behaviour depend on call order. + +**Recommendation** + +Key the cache by the credential identity (e.g. a dictionary keyed by `tenantId:clientId`, or by a hash of the credential string), or document and enforce the single-SMTP-config invariant. Given the design doc says one SMTP config is deployed per site, enforcing the invariant is acceptable but should be explicit. + +**Resolution** + +_Unresolved._ + +### NotificationService-007 — Connection timeout and max-concurrent-connections from the design doc are not implemented + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:11-14`, `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:16-20`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:111-140` | + +**Description** + +The design doc specifies an SMTP "Connection timeout (default 30s)" and "Max concurrent connections (default 5)", and `NotificationOptions`/`SmtpConfiguration` both carry these fields. Neither is enforced: `MailKitSmtpClientWrapper.ConnectAsync` never sets `SmtpClient.Timeout`, so the connection relies on MailKit's default timeout rather than the configured value (only the caller's `CancellationToken` bounds it, and callers may pass `default`). There is no semaphore or other throttle limiting concurrent SMTP connections per site, so `MaxConcurrentConnections` has no effect. Both options exist but are dead configuration. + +**Recommendation** + +Set `SmtpClient.Timeout` from `ConnectionTimeoutSeconds` in `ConnectAsync` (and/or derive a linked `CancellationTokenSource`). Introduce a `SemaphoreSlim(MaxConcurrentConnections)` gating `DeliverAsync`. If these limits are intentionally deferred, mark the options `[Obsolete]`/document them as not-yet-enforced and note the gap in the design doc. + +**Resolution** + +_Unresolved._ + +### NotificationService-008 — Recipient email addresses are not validated before send + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:136-137`, `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:50-53` | + +**Description** + +`SendAsync` builds `bccAddresses` directly from `recipient.EmailAddress` and passes them to `MailboxAddress.Parse`. If any recipient row has a malformed address, `MailboxAddress.Parse` throws `ParseException`. `ParseException` is not a `TimeoutException`/`SocketException`/`IOException` and its message will not generally contain "4." or "5.", so it falls through `DeliverAsync`'s outer `catch ... when (... && !IsTransientSmtpError(ex))` filter, which re-throws it (`:153`); it then escapes `SendAsync` entirely as an unhandled exception (the `SendAsync` catch blocks only cover `SmtpPermanentException` and transient errors). A single bad address in a list therefore crashes the send with an exception type the calling script is not told to expect, instead of producing a clean `NotificationResult` error. The same applies to a malformed `FromAddress`. + +**Recommendation** + +Validate addresses up front (e.g. `MailboxAddress.TryParse`) and return a `NotificationResult(false, ...)` listing invalid recipients, or wrap `DeliverAsync` so any non-classified exception becomes a permanent `NotificationResult` failure rather than escaping. Consider validating addresses at definition time in the Central UI as well. + +**Resolution** + +_Unresolved._ + +### NotificationService-009 — Credentials handled as plaintext strings; OAuth2 client secret logged risk + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:127-134`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:30-65`, `src/ScadaLink.Commons/Entities/Notifications/SmtpConfiguration.cs:9` | + +**Description** + +SMTP credentials — Basic Auth `user:pass` and OAuth2 `tenantId:clientId:clientSecret` — are stored and passed as a single colon-delimited plaintext `string` (`SmtpConfiguration.Credentials`). There is no indication the value is encrypted at rest in SQLite or in the central config DB. The colon-delimited packing is also brittle: a password or client secret containing a `:` will be split incorrectly (`Split(':', 2)` / `Split(':', 3)`), silently corrupting the secret. Separately, while the current code does not log the secret directly, the substring-based error classification logs full exception messages (`_logger.LogWarning(ex, ...)`, `LogError(ex, ...)`) and MailKit exceptions can echo back server responses; an authentication failure message could surface credential fragments into logs. There is no defensive scrubbing. + +**Recommendation** + +Store credentials encrypted at rest (DPAPI/Data Protection or a secret store) and model them as structured fields rather than a colon-packed string, so secrets containing `:` are safe. Ensure credential values are never written to logs; consider a redaction step on exception messages before logging. + +**Resolution** + +_Unresolved._ + +### NotificationService-010 — `DeliverAsync` does not disconnect the SMTP client on failure + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:121-154` | + +**Description** + +`DisconnectAsync` is only called at `:139`, on the success path inside the `try` block. If `AuthenticateAsync` or `SendAsync` throws, control jumps to the `catch` filter at `:141` and the method exits (re-throwing or wrapping) without ever calling `DisconnectAsync`. Combined with NS-004 (the client is never disposed either), a failed send leaves an open, authenticated SMTP connection until the socket is eventually reclaimed by finalization. Under sustained transient failures this can exhaust the SMTP server's connection slots. + +**Recommendation** + +Move disconnect/dispose into a `finally` block (or use `await using` once `ISmtpClientWrapper` supports `IAsyncDisposable`) so the connection is always torn down regardless of outcome. + +**Resolution** + +_Unresolved._ + +### NotificationService-011 — `SmtpPermanentException` declared in the wrong file; module conventions + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:173-177`, `src/ScadaLink.Commons/Entities/Notifications/SmtpConfiguration.cs:5-15` | + +**Description** + +Two minor convention issues. (1) `SmtpPermanentException` is a public exception type declared at the bottom of `NotificationDeliveryService.cs` rather than in its own file (`SmtpPermanentException.cs`), which is inconsistent with the one-type-per-file layout used elsewhere and makes it harder to locate. (2) `SmtpConfiguration` (a Commons POCO) declares non-nullable `string` properties (`Host`, `AuthType`, `FromAddress`) that are only guaranteed by the constructor; EF Core materialization or object-initializer use can leave them null while the type system says otherwise. These are persistence-ignorant POCO concerns but worth flagging because the delivery service dereferences `config.Host`, `config.AuthType`, `config.FromAddress` without null checks. + +**Recommendation** + +Move `SmtpPermanentException` to its own file. For `SmtpConfiguration`, either keep the constructor as the only path and document it, or use `required` members so the compiler enforces initialization. + +**Resolution** + +_Unresolved._ + +### NotificationService-012 — Test coverage gaps: OAuth2 delivery path, permanent-classification fallback, token-cache concurrency + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.NotificationService.Tests/NotificationDeliveryServiceTests.cs`, `tests/ScadaLink.NotificationService.Tests/OAuth2TokenServiceTests.cs` | + +**Description** + +The tests cover the happy path, list-not-found, no-recipients, no-SMTP-config, permanent failure, transient-without-S&F, and transient-with-S&F buffering. Notable untested paths: (1) the OAuth2 delivery branch in `DeliverAsync:128-132` — every test uses `tokenService: null` and Basic Auth, so OAuth2 token resolution during a send is never exercised; (2) `DeliverAsync`'s permanent-classification fallback (`:144-149`) that promotes a generic exception whose message contains "550"/"553"/"554" to `SmtpPermanentException` is never tested; (3) `OAuth2TokenServiceTests` never tests concurrent `GetTokenAsync` calls (the double-checked-locking path) or token expiry/refresh — the cache test uses a 3600s token so refresh never triggers; (4) no test covers the transient-with-S&F path actually delivering after retry (which would also have caught NS-001). Given NS-001 is a critical defect, the absence of an end-to-end buffer-and-retry test is significant. + +**Recommendation** + +Add tests for: OAuth2-authenticated send with a mocked `OAuth2TokenService`; the `DeliverAsync` 5xx-message permanent fallback; token expiry/refresh (short `expires_in`); concurrent token acquisition; and an end-to-end buffered-notification retry once a `Notification` S&F handler is registered. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/README.md b/code-reviews/README.md new file mode 100644 index 0000000..cb5e42a --- /dev/null +++ b/code-reviews/README.md @@ -0,0 +1,330 @@ +# Code Reviews + +Comprehensive, per-module code reviews of the ScadaLink codebase. Each module (one +buildable project under `src/`) has its own folder containing a `findings.md`. This +README is the aggregated index — the single place to see all outstanding work. + +## How it works + +- Reviews are performed one module at a time against a fixed checklist. +- Every finding is recorded in the module's `findings.md` with a severity and status. +- Findings are **never deleted** — they are closed by changing their status, keeping + a full audit trail. +- This README aggregates every **pending** finding (`Open` / `In Progress`) across all + modules. + +See **[REVIEW-PROCESS.md](REVIEW-PROCESS.md)** for the full procedure: the review +checklist, severity definitions, finding format, and how to mark items resolved. + +## Layout + +``` +code-reviews/ +├── README.md # this file — process overview + pending findings +├── REVIEW-PROCESS.md # how to perform a review and track findings +├── _template/findings.md # copy-this template for a module review +└── /findings.md # one folder per src/ project +``` + +## Baseline review — 2026-05-16 + +All 19 modules were reviewed at commit `9c60592`. This established the baseline below. + +| Severity | Open findings | +|----------|---------------| +| Critical | 6 | +| High | 46 | +| Medium | 100 | +| Low | 89 | +| **Total** | **241** | + +## Module Status + +| Module | Review status | Last reviewed | Commit | Open (C/H/M/L) | Total | +|--------|---------------|---------------|--------|----------------|-------| +| [CentralUI](CentralUI/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/3/10/5 | 19 | +| [CLI](CLI/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/6/6 | 13 | +| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/4/3 | 8 | +| [Commons](Commons/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/0/4/8 | 12 | +| [Communication](Communication/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/2/5/3 | 11 | +| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/4/6 | 11 | +| [DataConnectionLayer](DataConnectionLayer/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/4/6/2 | 13 | +| [DeploymentManager](DeploymentManager/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/6/5 | 14 | +| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/2/7/4 | 14 | +| [HealthMonitoring](HealthMonitoring/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/2/5/5 | 12 | +| [Host](Host/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/3/7 | 11 | +| [InboundAPI](InboundAPI/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/5/5 | 13 | +| [ManagementService](ManagementService/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/5/5 | 13 | +| [NotificationService](NotificationService/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/3/5/3 | 12 | +| [Security](Security/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/4/4 | 11 | +| [SiteEventLogging](SiteEventLogging/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/4/4/3 | 11 | +| [SiteRuntime](SiteRuntime/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/8/5 | 16 | +| [StoreAndForward](StoreAndForward/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/2/4/6 | 13 | +| [TemplateEngine](TemplateEngine/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/5/5/4 | 14 | + +## Pending Findings + +All findings are currently `Open`. As findings are resolved, remove them from the +tables below (see [REVIEW-PROCESS.md](REVIEW-PROCESS.md) §5). Full detail for each +finding — description, location, recommendation — lives in the module's `findings.md`. + +### Critical (6) + +| ID | Module | Title | +|----|--------|-------| +| CentralUI-001 | [CentralUI](CentralUI/findings.md) | Test Run sandbox executes arbitrary C# with no trust-model enforcement | +| Communication-001 | [Communication](Communication/findings.md) | Snapshot timeout leaves orphaned bridge actor and site subscription | +| DataConnectionLayer-001 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread | +| ExternalSystemGateway-001 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | No S&F delivery handler registered; cached calls and writes can never be delivered | +| NotificationService-001 | [NotificationService](NotificationService/findings.md) | Buffered notifications are never retried (no S&F delivery handler) | +| StoreAndForward-001 | [StoreAndForward](StoreAndForward/findings.md) | Replication to standby is never triggered by the active node | + +### High (46) + +| ID | Module | Title | +|----|--------|-------| +| CLI-001 | [CLI](CLI/findings.md) | `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken | +| CentralUI-002 | [CentralUI](CentralUI/findings.md) | Site-scoped Deployment permissions are issued but never enforced | +| CentralUI-003 | [CentralUI](CentralUI/findings.md) | `Console.SetOut`/`SetError` mutates process-global state across concurrent circuits | +| CentralUI-004 | [CentralUI](CentralUI/findings.md) | `CookieAuthenticationStateProvider` reads `HttpContext` for the life of the circuit | +| ClusterInfrastructure-001 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Module implements none of its documented responsibilities | +| Communication-002 | [Communication](Communication/findings.md) | gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors | +| Communication-003 | [Communication](Communication/findings.md) | SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream | +| ConfigurationDatabase-001 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetTemplateWithChildrenAsync` loads child templates then discards them | +| DataConnectionLayer-002 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `Restart` supervision discards all subscription state on connection-actor crash | +| DataConnectionLayer-003 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization | +| DataConnectionLayer-004 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly | +| DataConnectionLayer-005 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteTimeout` option is documented and configured but never applied | +| DeploymentManager-001 | [DeploymentManager](DeploymentManager/findings.md) | Unexpected exceptions leave the deployment record stuck in `InProgress` | +| DeploymentManager-002 | [DeploymentManager](DeploymentManager/findings.md) | Failure-status write uses a possibly-cancelled cancellation token | +| DeploymentManager-006 | [DeploymentManager](DeploymentManager/findings.md) | Query-the-site-before-redeploy idempotency requirement not implemented | +| ExternalSystemGateway-002 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Per-system call timeout is never applied to HTTP requests | +| ExternalSystemGateway-003 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `CachedCall` double-dispatches the HTTP request | +| HealthMonitoring-001 | [HealthMonitoring](HealthMonitoring/findings.md) | Store-and-forward buffer depth metric is never populated | +| HealthMonitoring-002 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthState` mutable fields written from multiple threads without synchronization | +| Host-001 | [Host](Host/findings.md) | `/health/ready` includes the leader-only `active-node` check | +| InboundAPI-001 | [InboundAPI](InboundAPI/findings.md) | Singleton script handler cache mutated without synchronization | +| InboundAPI-003 | [InboundAPI](InboundAPI/findings.md) | API key compared with non-constant-time string equality | +| InboundAPI-005 | [InboundAPI](InboundAPI/findings.md) | Compiled API scripts run with no script-trust-model enforcement | +| ManagementService-001 | [ManagementService](ManagementService/findings.md) | Remote-query and debug-snapshot handlers bypass site-scope enforcement | +| ManagementService-002 | [ManagementService](ManagementService/findings.md) | Single-entity query handlers leak data across site scope | +| ManagementService-003 | [ManagementService](ManagementService/findings.md) | DebugStreamHub.SubscribeInstance performs no per-instance authorization | +| NotificationService-002 | [NotificationService](NotificationService/findings.md) | `TimeoutException`/`OperationCanceledException` misclassified as transient | +| NotificationService-003 | [NotificationService](NotificationService/findings.md) | Error classification by substring matching on exception messages is fragile | +| NotificationService-004 | [NotificationService](NotificationService/findings.md) | `DeliverAsync` constructs two SMTP clients and leaks the used one | +| Security-001 | [Security](Security/findings.md) | StartTLS upgrade path is unreachable dead code | +| Security-002 | [Security](Security/findings.md) | Authentication cookie is not marked `Secure` | +| Security-003 | [Security](Security/findings.md) | JWT signing key length is never validated | +| SiteEventLogging-001 | [SiteEventLogging](SiteEventLogging/findings.md) | `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space | +| SiteEventLogging-002 | [SiteEventLogging](SiteEventLogging/findings.md) | Storage-cap purge deletes the entire table when space is not reclaimed | +| SiteEventLogging-003 | [SiteEventLogging](SiteEventLogging/findings.md) | Shared `SqliteConnection` used by purge and query without the write lock | +| SiteEventLogging-004 | [SiteEventLogging](SiteEventLogging/findings.md) | Event-log handler runs as a cluster singleton that can land on the standby node | +| SiteRuntime-001 | [SiteRuntime](SiteRuntime/findings.md) | `Instance.SetAttribute` never writes to the Data Connection Layer | +| SiteRuntime-002 | [SiteRuntime](SiteRuntime/findings.md) | `RouteInboundApiSetAttributes` always treats writes as static overrides | +| SiteRuntime-003 | [SiteRuntime](SiteRuntime/findings.md) | Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name | +| StoreAndForward-002 | [StoreAndForward](StoreAndForward/findings.md) | Messages enqueued with no registered handler are buffered but never deliverable | +| StoreAndForward-003 | [StoreAndForward](StoreAndForward/findings.md) | Off-by-one in retry accounting: immediate failure pre-counts as retry 1 | +| TemplateEngine-001 | [TemplateEngine](TemplateEngine/findings.md) | Deeply nested composed members are dropped during flattening | +| TemplateEngine-002 | [TemplateEngine](TemplateEngine/findings.md) | Derived templates omit all base alarms; composed alarms cannot be overridden per slot | +| TemplateEngine-003 | [TemplateEngine](TemplateEngine/findings.md) | `UpdateAttributeAsync` lets a non-locked attribute change its fixed DataType / DataSourceReference | +| TemplateEngine-004 | [TemplateEngine](TemplateEngine/findings.md) | Alarm on-trigger script references are never resolved (empty placeholder) | +| TemplateEngine-005 | [TemplateEngine](TemplateEngine/findings.md) | Collision validation is skipped when creating a child template | + +### Medium (100) + +| ID | Module | Title | +|----|--------|-------| +| CLI-002 | [CLI](CLI/findings.md) | Empty success body crashes table rendering with an unhandled exception | +| CLI-003 | [CLI](CLI/findings.md) | Non-JSON success body crashes table rendering | +| CLI-004 | [CLI](CLI/findings.md) | Malformed `--url` throws an unhandled `UriFormatException` | +| CLI-005 | [CLI](CLI/findings.md) | Malformed `--bindings` / `--overrides` JSON throws unhandled exceptions | +| CLI-006 | [CLI](CLI/findings.md) | Password is passed as a command-line argument with no safer alternative | +| CLI-007 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface is substantially stale | +| CentralUI-005 | [CentralUI](CentralUI/findings.md) | Session expiry implementation diverges from the documented policy | +| CentralUI-006 | [CentralUI](CentralUI/findings.md) | Deployment status page polls every 10s despite the documented SignalR-push design | +| CentralUI-007 | [CentralUI](CentralUI/findings.md) | Monitoring nav links to Deployment-only pages are shown to all roles | +| CentralUI-008 | [CentralUI](CentralUI/findings.md) | Audit-log date filters treat browser-local datetimes as UTC | +| CentralUI-009 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callbacks touch a possibly-disposed `ToastNotification` | +| CentralUI-010 | [CentralUI](CentralUI/findings.md) | `ToastNotification` auto-dismiss continuation runs after component disposal | +| CentralUI-011 | [CentralUI](CentralUI/findings.md) | `DiffDialog` leaves a dangling `TaskCompletionSource` when disposed while open | +| CentralUI-012 | [CentralUI](CentralUI/findings.md) | N+1 query loading data connections for the Sites page | +| CentralUI-013 | [CentralUI](CentralUI/findings.md) | `ScriptAnalysisService` blocks on async shared-script lookups | +| CentralUI-014 | [CentralUI](CentralUI/findings.md) | Test Run side effects (HTTP/SQL/SMTP) fire against production services | +| ClusterInfrastructure-002 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | No-op DI extension methods report success while doing nothing | +| ClusterInfrastructure-003 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | ClusterOptions omits several documented node-configuration settings | +| ClusterInfrastructure-004 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | ClusterOptions has no validation despite safety-critical values | +| ClusterInfrastructure-006 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | No tests for any cluster behaviour; only the options POCO is covered | +| Commons-001 | [Commons](Commons/findings.md) | `StaleTagMonitor` stale-fire race between timer and `OnValueReceived` | +| Commons-002 | [Commons](Commons/findings.md) | `DynamicJsonElement` retains a `JsonElement` whose `JsonDocument` lifetime it does not own | +| Commons-003 | [Commons](Commons/findings.md) | `ScriptParameters.GetNullable` silently swallows conversion failures | +| Commons-004 | [Commons](Commons/findings.md) | `ManagementCommandRegistry` name mapping is asymmetric and namespace-scoped | +| Communication-004 | [Communication](Communication/findings.md) | Coordinator actors declare no SupervisorStrategy (design requires Resume) | +| Communication-005 | [Communication](Communication/findings.md) | gRPC keepalive and max-stream-lifetime options are defined but never applied | +| Communication-006 | [Communication](Communication/findings.md) | Site address load failures are silently swallowed, leaving a stale cache | +| Communication-007 | [Communication](Communication/findings.md) | `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async) | +| Communication-008 | [Communication](Communication/findings.md) | Reconnect retry-count reset can mask a flapping stream indefinitely | +| ConfigurationDatabase-002 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Hardcoded `sa` connection string with embedded password literal | +| ConfigurationDatabase-003 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | No-arg `AddConfigurationDatabase()` silently registers nothing | +| ConfigurationDatabase-004 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-bearing columns stored in plaintext with no protection | +| ConfigurationDatabase-007 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `AuditService` does not handle JSON-serialization failure of arbitrary `afterState` | +| DataConnectionLayer-006 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Health quality counters not reset/recomputed after failover or re-subscribe | +| DataConnectionLayer-007 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `ReadBatchAsync` aborts the whole batch on the first failing tag | +| DataConnectionLayer-009 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Implemented failover heuristic diverges from the documented state machine | +| DataConnectionLayer-010 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Tag-resolution retry can issue duplicate concurrent subscribe attempts | +| DataConnectionLayer-011 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Stale subscription callbacks from disposed adapters can still reach the actor | +| DataConnectionLayer-012 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate | +| DeploymentManager-003 | [DeploymentManager](DeploymentManager/findings.md) | Successful-deployment cleanup is not atomic with the status write | +| DeploymentManager-004 | [DeploymentManager](DeploymentManager/findings.md) | Site-success but central-delete-failure leaves orphaned site config | +| DeploymentManager-005 | [DeploymentManager](DeploymentManager/findings.md) | `OperationLockManager` leaks a `SemaphoreSlim` per instance name | +| DeploymentManager-007 | [DeploymentManager](DeploymentManager/findings.md) | "Diff View" reduced to a hash comparison with no diff detail | +| DeploymentManager-008 | [DeploymentManager](DeploymentManager/findings.md) | `DeploymentManagerOptions` is never bound to configuration | +| DeploymentManager-011 | [DeploymentManager](DeploymentManager/findings.md) | Tests never exercise a successful deployment or lifecycle success path | +| ExternalSystemGateway-004 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | System retry settings are not honoured for cached calls/writes | +| ExternalSystemGateway-005 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpRequestMessage` and `HttpResponseMessage` are not disposed | +| ExternalSystemGateway-006 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` ignores path templates and appends a trailing slash for empty paths | +| ExternalSystemGateway-007 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | External error response bodies are echoed verbatim into script-visible error messages | +| ExternalSystemGateway-008 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Cancellation is conflated with transient timeout failure | +| ExternalSystemGateway-009 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed | +| ExternalSystemGateway-010 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails | +| HealthMonitoring-003 | [HealthMonitoring](HealthMonitoring/findings.md) | Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate | +| HealthMonitoring-005 | [HealthMonitoring](HealthMonitoring/findings.md) | Central self-report site can flap offline; no heartbeat grace like real sites | +| HealthMonitoring-007 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeats for not-yet-registered sites are silently dropped | +| HealthMonitoring-008 | [HealthMonitoring](HealthMonitoring/findings.md) | `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers | +| HealthMonitoring-009 | [HealthMonitoring](HealthMonitoring/findings.md) | Missing test coverage for central report loop, heartbeat path, replication, and collector setters | +| Host-002 | [Host](Host/findings.md) | Akka.Persistence required by REQ-HOST-6 is not configured and not used | +| Host-003 | [Host](Host/findings.md) | Secrets committed in plaintext in `appsettings.Central.json` | +| Host-004 | [Host](Host/findings.md) | Site seed-node list points at the gRPC port, not a remoting port | +| InboundAPI-002 | [InboundAPI](InboundAPI/findings.md) | Lazy compilation is a check-then-act race with no atomicity | +| InboundAPI-004 | [InboundAPI](InboundAPI/findings.md) | Client disconnect is misreported as a script timeout | +| InboundAPI-006 | [InboundAPI](InboundAPI/findings.md) | No request body size limit on the inbound endpoint | +| InboundAPI-007 | [InboundAPI](InboundAPI/findings.md) | `Database.Connection()` script API from the design doc is not implemented | +| InboundAPI-008 | [InboundAPI](InboundAPI/findings.md) | Inbound API endpoint not restricted to the active central node | +| ManagementService-004 | [ManagementService](ManagementService/findings.md) | Actor offloads work to Task.Run instead of using PipeTo | +| ManagementService-006 | [ManagementService](ManagementService/findings.md) | JsonDocument instances never disposed in the HTTP endpoint | +| ManagementService-007 | [ManagementService](ManagementService/findings.md) | Inconsistent and cycle-prone serialization of repository entities | +| ManagementService-009 | [ManagementService](ManagementService/findings.md) | Audit logging applied inconsistently across mutating handlers | +| ManagementService-013 | [ManagementService](ManagementService/findings.md) | No tests for site-scope enforcement, the HTTP endpoint, or DebugStreamHub | +| NotificationService-005 | [NotificationService](NotificationService/findings.md) | Non-TLS path uses `SecureSocketOptions.Auto`, contradicting the requested mode | +| NotificationService-006 | [NotificationService](NotificationService/findings.md) | OAuth2 token cache is keyed to nothing; wrong token returned when multiple SMTP configs exist | +| NotificationService-007 | [NotificationService](NotificationService/findings.md) | Connection timeout and max-concurrent-connections from the design doc are not implemented | +| NotificationService-008 | [NotificationService](NotificationService/findings.md) | Recipient email addresses are not validated before send | +| NotificationService-009 | [NotificationService](NotificationService/findings.md) | Credentials handled as plaintext strings; OAuth2 client secret logged risk | +| Security-004 | [Security](Security/findings.md) | Search filter uses `uid=` while fallback DN construction uses `cn=` | +| Security-005 | [Security](Security/findings.md) | DN injection in the no-service-account bind fallback | +| Security-006 | [Security](Security/findings.md) | JWT validation disables issuer and audience checks | +| Security-007 | [Security](Security/findings.md) | Idle-timeout claim is reset on every token refresh | +| SiteEventLogging-005 | [SiteEventLogging](SiteEventLogging/findings.md) | `LogEventAsync` performs synchronous disk I/O on the caller's thread | +| SiteEventLogging-007 | [SiteEventLogging](SiteEventLogging/findings.md) | `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection | +| SiteEventLogging-008 | [SiteEventLogging](SiteEventLogging/findings.md) | Event-recording write failures are silently swallowed | +| SiteEventLogging-010 | [SiteEventLogging](SiteEventLogging/findings.md) | Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path | +| SiteRuntime-004 | [SiteRuntime](SiteRuntime/findings.md) | `_totalDeployedCount` is incremented on redeployment of an existing instance | +| SiteRuntime-005 | [SiteRuntime](SiteRuntime/findings.md) | Deployment reports `Success` to central before persistence completes | +| SiteRuntime-006 | [SiteRuntime](SiteRuntime/findings.md) | Site-local repositories read `SiteStorageService` private field via reflection | +| SiteRuntime-007 | [SiteRuntime](SiteRuntime/findings.md) | Synthetic entity IDs use the non-deterministic `string.GetHashCode()` | +| SiteRuntime-008 | [SiteRuntime](SiteRuntime/findings.md) | Blocking `.GetAwaiter().GetResult()` on the actor thread during startup | +| SiteRuntime-009 | [SiteRuntime](SiteRuntime/findings.md) | Script execution actors run scripts on the default thread pool, not a dedicated dispatcher | +| SiteRuntime-010 | [SiteRuntime](SiteRuntime/findings.md) | `EnsureDclConnections` never updates a connection whose configuration changed | +| SiteRuntime-011 | [SiteRuntime](SiteRuntime/findings.md) | Trust-model validation is a substring scan and is both over- and under-inclusive | +| StoreAndForward-004 | [StoreAndForward](StoreAndForward/findings.md) | `RegisterDeliveryHandler` XML doc contradicts the implemented contract | +| StoreAndForward-005 | [StoreAndForward](StoreAndForward/findings.md) | Parked-message retry/discard can race with the in-progress retry sweep | +| StoreAndForward-010 | [StoreAndForward](StoreAndForward/findings.md) | Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified | +| StoreAndForward-013 | [StoreAndForward](StoreAndForward/findings.md) | Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge | +| TemplateEngine-006 | [TemplateEngine](TemplateEngine/findings.md) | Forbidden-API enforcement is a naive substring scan (bypassable and false-positive prone) | +| TemplateEngine-007 | [TemplateEngine](TemplateEngine/findings.md) | Brace-balance "compilation" misjudges verbatim / interpolated / raw strings | +| TemplateEngine-008 | [TemplateEngine](TemplateEngine/findings.md) | `SetAlarmOverrideAsync` accepts overrides for unknown / composed alarms with no validation | +| TemplateEngine-009 | [TemplateEngine](TemplateEngine/findings.md) | N+1 query in `TemplateDeletionService.CanDeleteTemplateAsync` | +| TemplateEngine-010 | [TemplateEngine](TemplateEngine/findings.md) | `InstanceService` documents optimistic concurrency that is not implemented | + +### Low (89) + +| ID | Module | Title | +|----|--------|-------| +| CLI-008 | [CLI](CLI/findings.md) | `--format` value is not validated | +| CLI-009 | [CLI](CLI/findings.md) | Exit-code documentation does not match `HandleResponse` behaviour | +| CLI-010 | [CLI](CLI/findings.md) | `debug stream` reports Ctrl+C during connect as a connection failure | +| CLI-011 | [CLI](CLI/findings.md) | `CancellationTokenSource` in `debug stream` is never disposed | +| CLI-012 | [CLI](CLI/findings.md) | `debug stream` exit code is unreliable after stream termination | +| CLI-013 | [CLI](CLI/findings.md) | HTTP client, `debug stream`, and JSON-argument parsing are untested | +| CentralUI-015 | [CentralUI](CentralUI/findings.md) | `DialogService` continuations resolve off the render thread | +| CentralUI-016 | [CentralUI](CentralUI/findings.md) | Pagers render one button per page with no windowing | +| CentralUI-017 | [CentralUI](CentralUI/findings.md) | `/auth/logout` POST disables antiforgery, enabling logout CSRF | +| CentralUI-018 | [CentralUI](CentralUI/findings.md) | Broad `catch {}` blocks swallow JS interop and storage errors silently | +| CentralUI-019 | [CentralUI](CentralUI/findings.md) | Sparse unit-test coverage for a large module; critical paths untested | +| ClusterInfrastructure-005 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | No configuration section name constant for the Options pattern binding | +| ClusterInfrastructure-007 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | ClusterOptions lacks XML documentation comments | +| ClusterInfrastructure-008 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | "Phase 0 skeleton" status is undocumented at the module level | +| Commons-005 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` discards malformed legacy input and over-reports `IsLegacy` | +| Commons-006 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryConvert` reports success for unconvertible target types | +| Commons-007 | [Commons](Commons/findings.md) | Several Commons types carry non-trivial logic, stretching REQ-COM-6 | +| Commons-008 | [Commons](Commons/findings.md) | `SetConnectionBindingsCommand` uses `ValueTuple` in a wire message contract | +| Commons-009 | [Commons](Commons/findings.md) | `Component-Commons.md` is stale relative to the actual file set | +| Commons-010 | [Commons](Commons/findings.md) | Behavior-bearing Commons types have no unit tests | +| Commons-011 | [Commons](Commons/findings.md) | `Result.Failure` accepts a null error string | +| Commons-012 | [Commons](Commons/findings.md) | `ValueFormatter` uses current-culture formatting without documenting it | +| Communication-009 | [Communication](Communication/findings.md) | `_siteClients` field is mutable and reassignable; cache update is not atomic on failure | +| Communication-010 | [Communication](Communication/findings.md) | `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor" | +| Communication-011 | [Communication](Communication/findings.md) | No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak | +| ConfigurationDatabase-005 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Audit `Id` type disagrees with the design doc | +| ConfigurationDatabase-006 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `Site.GrpcNodeAAddress` / `GrpcNodeBAddress` columns are unbounded | +| ConfigurationDatabase-008 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetApprovedKeysForMethodAsync` CSV parsing silently drops malformed ids | +| ConfigurationDatabase-009 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Multi-collection eager loads issue cartesian-product queries | +| ConfigurationDatabase-010 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Several repositories and `InstanceLocator` lack direct test coverage | +| ConfigurationDatabase-011 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inconsistent constructor null-guarding across repositories/services | +| DataConnectionLayer-008 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly | +| DataConnectionLayer-013 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide | +| DeploymentManager-009 | [DeploymentManager](DeploymentManager/findings.md) | Misleading timeout comment on `DeleteInstanceAsync` | +| DeploymentManager-010 | [DeploymentManager](DeploymentManager/findings.md) | `SystemArtifactDeploymentRecord` does not persist the deployment ID | +| DeploymentManager-012 | [DeploymentManager](DeploymentManager/findings.md) | `LifecycleCommandTimeout` option is dead code | +| DeploymentManager-013 | [DeploymentManager](DeploymentManager/findings.md) | SMTP credentials serialized and broadcast to all sites | +| DeploymentManager-014 | [DeploymentManager](DeploymentManager/findings.md) | Dead `CreateCommand` helper in artifact tests | +| ExternalSystemGateway-011 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Every call performs a full repository scan of all systems and methods | +| ExternalSystemGateway-012 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Permanent-failure logging requirement is not met; `_logger` is injected but unused | +| ExternalSystemGateway-013 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used | +| ExternalSystemGateway-014 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Cached-call buffering path and `DatabaseGateway` are untested | +| HealthMonitoring-004 | [HealthMonitoring](HealthMonitoring/findings.md) | Inconsistent heartbeat interval described across XML docs | +| HealthMonitoring-006 | [HealthMonitoring](HealthMonitoring/findings.md) | Sequence seeding contradicts the doc's "starting at 1" wording and is untestable | +| HealthMonitoring-010 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` silently swallows inner failures with bare `catch {}` | +| HealthMonitoring-011 | [HealthMonitoring](HealthMonitoring/findings.md) | `AddHealthMonitoringActors` is a dead no-op placeholder | +| HealthMonitoring-012 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract | +| Host-005 | [Host](Host/findings.md) | Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync` | +| Host-006 | [Host](Host/findings.md) | HOCON assembled by unescaped string interpolation | +| Host-007 | [Host](Host/findings.md) | REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced | +| Host-008 | [Host](Host/findings.md) | `MachineDataDb` is validated and declared but never consumed | +| Host-009 | [Host](Host/findings.md) | `StartAsync` reports success before role actors are confirmed running | +| Host-010 | [Host](Host/findings.md) | No retry/backoff around startup preconditions (DB migration, readiness) | +| Host-011 | [Host](Host/findings.md) | `LoggingOptions.MinimumLevel` is dead configuration | +| InboundAPI-009 | [InboundAPI](InboundAPI/findings.md) | Failed compilation is retried on every subsequent request | +| InboundAPI-010 | [InboundAPI](InboundAPI/findings.md) | `ParameterValidator` ignores extra body fields and cannot validate Object/List element types | +| InboundAPI-011 | [InboundAPI](InboundAPI/findings.md) | Method-existence check leaks to unapproved callers (enumeration oracle) | +| InboundAPI-012 | [InboundAPI](InboundAPI/findings.md) | `ParameterDefinition` POCO declared in the component project, not Commons | +| InboundAPI-013 | [InboundAPI](InboundAPI/findings.md) | `ApiKeyValidationResult.NotFound` factory returns HTTP 400, contradicting its name | +| ManagementService-005 | [ManagementService](ManagementService/findings.md) | ManagementActor declares no supervision strategy | +| ManagementService-008 | [ManagementService](ManagementService/findings.md) | HandleResolveRoles constructs RoleMapper manually instead of via DI | +| ManagementService-010 | [ManagementService](ManagementService/findings.md) | ManagementServiceOptions.CommandTimeout is defined but never used | +| ManagementService-011 | [ManagementService](ManagementService/findings.md) | ResolveRolesCommand dispatch path is stale dead code | +| ManagementService-012 | [ManagementService](ManagementService/findings.md) | ManagementEnvelope carries a loosely-typed object payload | +| NotificationService-010 | [NotificationService](NotificationService/findings.md) | `DeliverAsync` does not disconnect the SMTP client on failure | +| NotificationService-011 | [NotificationService](NotificationService/findings.md) | `SmtpPermanentException` declared in the wrong file; module conventions | +| NotificationService-012 | [NotificationService](NotificationService/findings.md) | Test coverage gaps: OAuth2 delivery path, permanent-classification fallback, token-cache concurrency | +| Security-008 | [Security](Security/findings.md) | N+1 query loading site-scope rules in `RoleMapper` | +| Security-009 | [Security](Security/findings.md) | CancellationToken not honored inside `Task.Run` LDAP calls | +| Security-010 | [Security](Security/findings.md) | Design doc contradicts itself on Windows Integrated Authentication | +| Security-011 | [Security](Security/findings.md) | Missing tests for security-critical paths | +| SiteEventLogging-006 | [SiteEventLogging](SiteEventLogging/findings.md) | Missing indexes for severity and keyword-search query paths | +| SiteEventLogging-009 | [SiteEventLogging](SiteEventLogging/findings.md) | XML doc on `LogEventAsync` claims asynchronous behaviour | +| SiteEventLogging-011 | [SiteEventLogging](SiteEventLogging/findings.md) | Stale "Phase 4+" placeholder in `ServiceCollectionExtensions` | +| SiteRuntime-012 | [SiteRuntime](SiteRuntime/findings.md) | `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask | +| SiteRuntime-013 | [SiteRuntime](SiteRuntime/findings.md) | `HandleUnsubscribeDebugView` does nothing despite documented behaviour | +| SiteRuntime-014 | [SiteRuntime](SiteRuntime/findings.md) | Trigger-expression evaluation blocks the coordinator actor thread | +| SiteRuntime-015 | [SiteRuntime](SiteRuntime/findings.md) | `LoggerFactory` created per Instance Actor and never disposed | +| SiteRuntime-016 | [SiteRuntime](SiteRuntime/findings.md) | Short-lived execution actors, replication actor, and repositories are untested | +| StoreAndForward-006 | [StoreAndForward](StoreAndForward/findings.md) | `GetParkedMessagesAsync` count and page run without a transaction | +| StoreAndForward-007 | [StoreAndForward](StoreAndForward/findings.md) | Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees | +| StoreAndForward-008 | [StoreAndForward](StoreAndForward/findings.md) | A SQLite connection is opened and torn down on every storage call | +| StoreAndForward-009 | [StoreAndForward](StoreAndForward/findings.md) | `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe | +| StoreAndForward-011 | [StoreAndForward](StoreAndForward/findings.md) | `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled | +| StoreAndForward-012 | [StoreAndForward](StoreAndForward/findings.md) | `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons | +| TemplateEngine-011 | [TemplateEngine](TemplateEngine/findings.md) | `SortedPropertiesConverterFactory` is dead code with a misleading comment | +| TemplateEngine-012 | [TemplateEngine](TemplateEngine/findings.md) | `DataType` enum naming diverges from the design doc | +| TemplateEngine-013 | [TemplateEngine](TemplateEngine/findings.md) | `ToDictionary(t => t.Id)` throws on duplicate IDs; cycle detectors overload Id 0 as a sentinel | +| TemplateEngine-014 | [TemplateEngine](TemplateEngine/findings.md) | Template-deletion constraint logic is duplicated and divergent | diff --git a/code-reviews/REVIEW-PROCESS.md b/code-reviews/REVIEW-PROCESS.md new file mode 100644 index 0000000..e2ee1a2 --- /dev/null +++ b/code-reviews/REVIEW-PROCESS.md @@ -0,0 +1,109 @@ +# Code Review Process + +This document describes how to perform a comprehensive, per-module code review of +the ScadaLink codebase and how to track findings to resolution. + +A **module** is one buildable project under `src/` (e.g. `src/ScadaLink.TemplateEngine`). +Each module has its own folder under `code-reviews/` containing a single `findings.md`. + +## 1. Before you start + +1. Pick the module to review. Its folder is `code-reviews//` where `` + is the project name with the `ScadaLink.` prefix stripped. +2. Identify the design context for the module: + - Its component design doc: `docs/requirements/Component-.md`. + - The relevant **Key Design Decisions** in `CLAUDE.md`. + - `docs/requirements/HighLevelReqs.md` for cross-cutting requirements. +3. Record the exact commit being reviewed: `git rev-parse --short HEAD`. Every review + is a snapshot — a finding only means something relative to a known commit. +4. Open `code-reviews//findings.md` and fill in the header table + (reviewer, date, commit SHA). + +## 2. Review checklist + +Work through **every** category below for the module. A comprehensive review means +the checklist is completed even where it produces no findings — record "No issues +found" for a category rather than leaving it ambiguous. + +1. **Correctness & logic bugs** — off-by-one, null handling, incorrect conditionals, + misuse of APIs, broken edge cases. +2. **Akka.NET conventions** — supervision strategies (Resume for coordinators, Stop + for short-lived actors), `Tell` for hot paths / `Ask` only at system boundaries, + message immutability, no blocking on non-blocking dispatchers, no `sender`/`this` + captured in closures (`PipeTo` instead), correlation IDs on request/response. +3. **Concurrency & thread safety** — shared mutable state, actor state mutated only + on the actor thread, race conditions, correct use of async/await. +4. **Error handling & resilience** — exception paths, store-and-forward integration, + reconnect/retry logic, failover behaviour, transient vs permanent error + classification, graceful degradation. +5. **Security** — authentication/authorization checks, input validation, the script + trust model (forbidden APIs: `System.IO`, `Process`, `Threading`, `Reflection`, + raw network), secret handling, SQL/LDAP injection, logging of sensitive data. +6. **Performance & resource management** — `IDisposable` disposal, stream/connection + lifetimes, buffering and back-pressure, unnecessary allocations, N+1 queries. +7. **Design-document adherence** — does the code match `Component-.md` and the + relevant CLAUDE.md decisions? Flag both code that drifts from the design and design + docs that are now stale. +8. **Code organization & conventions** — persistence-ignorant POCO entities in + Commons, repository interfaces in Commons / implementations in ConfigurationDatabase, + namespace hierarchy, Options pattern (options classes owned by component projects), + additive-only message contract evolution. +9. **Testing coverage** — are the module's behaviours covered by tests in `tests/`? + Note untested critical paths and missing edge-case tests. +10. **Documentation & comments** — XML doc accuracy, misleading or stale comments, + undocumented non-obvious behaviour. + +## 3. Recording findings + +Add one entry per finding to the `## Findings` section of the module's `findings.md`, +using the entry format in [`_template/findings.md`](_template/findings.md). + +- **Finding ID** — `-NNN`, numbered sequentially within the module and never + reused (e.g. `TemplateEngine-001`). IDs are permanent even after resolution. +- **Severity:** + - **Critical** — data loss, security breach, crash/deadlock, or cluster-wide outage. + - **High** — incorrect behaviour with significant impact; no safe workaround. + - **Medium** — incorrect or risky behaviour with limited impact or a workaround. + - **Low** — minor issues, style, maintainability, documentation. +- **Category** — one of the 10 checklist categories above. +- **Location** — `file:line` (clickable), or a list of locations. +- **Description** — what is wrong and why it matters. +- **Recommendation** — concrete suggested fix. + +After recording findings, update the module header table (status, open-finding count) +and refresh the base README (step 5). + +## 4. Marking an item resolved + +Findings are **never deleted** — they are an audit trail. To close one, change its +**Status** and complete the **Resolution** field: + +- `Open` — newly recorded, not yet addressed. +- `In Progress` — a fix is actively being worked on. +- `Resolved` — fixed. The Resolution field must state the fixing commit SHA, the + date, and a one-line description of the fix. +- `Won't Fix` — intentionally not fixed. The Resolution field must justify why. +- `Deferred` — valid but postponed. The Resolution field must say what it is waiting + on (e.g. a tracked issue or a later milestone). + +`Resolved`, `Won't Fix`, and `Deferred` findings are all considered **closed** and +drop off the base README's pending list. `Open` and `In Progress` are **pending**. + +## 5. Updating the base README + +`code-reviews/README.md` holds the single cross-module view. After any review or +status change, update it: + +1. **Pending Findings table** — add/remove rows so it lists exactly the `Open` and + `In Progress` findings across all modules, sorted by severity. +2. **Module Status table** — update the row for the reviewed module (last-reviewed + date, commit, open-finding count, review status). + +The base README must always agree with the per-module `findings.md` files — they are +the source of truth; the README is the aggregated index. + +## 6. Re-reviewing a module + +Re-reviews append to the same `findings.md`. Update the header to the new commit and +date, continue the finding numbering from the last used ID, and leave prior findings +(including closed ones) in place as history. diff --git a/code-reviews/Security/findings.md b/code-reviews/Security/findings.md new file mode 100644 index 0000000..5ef117c --- /dev/null +++ b/code-reviews/Security/findings.md @@ -0,0 +1,365 @@ +# Code Review — Security + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.Security` | +| Design doc | `docs/requirements/Component-Security.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 11 | + +## Summary + +The Security module is small and reasonably structured: a stateless `LdapAuthService` +for search-then-bind authentication, a `JwtTokenService` for HMAC-signed cookie tokens, +a `RoleMapper` that resolves LDAP groups to roles, and ASP.NET Core authorization +policies plus a site-scope handler. Unit-test coverage of the happy paths is decent. +However, the review surfaced several real security weaknesses, the most serious being +that **StartTLS is dead code** (the design's "LDAPS or StartTLS" requirement is only +half met), that **the authentication cookie is not marked `Secure`** despite the design +mandating it, and that **the JWT signing key is never length-validated** so a weak or +empty key is silently accepted. There is also a genuine **DN-injection** gap in the +no-service-account fallback path, a filter/DN attribute mismatch (`uid=` vs `cn=`) that +makes that fallback path internally inconsistent, and an N+1 query in `RoleMapper`. +JWT validation also disables issuer/audience checks and the idle-timeout claim is reset +on every refresh, weakening the documented 30-minute idle policy. None of these are +crash/data-loss bugs, but the TLS, cookie, and key-validation items are security +defects that should be fixed before any production deployment. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `uid=`/`cn=` attribute mismatch between search filter and fallback DN construction (Security-004); StartTLS branch is unreachable (Security-001). | +| 2 | Akka.NET conventions | ☑ | No actors in this module — `AddSecurityActors` is an empty placeholder. Nothing to assess. | +| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; LDAP sync calls wrapped in `Task.Run`. No shared mutable state. No issues found. | +| 4 | Error handling & resilience | ☑ | LDAP failure paths return structured `LdapAuthResult`; group-lookup failure is tolerated per design. `ct` not honored inside `Task.Run` bodies (Security-009). | +| 5 | Security | ☑ | StartTLS dead code (Security-001), cookie not `Secure` (Security-002), JWT key unvalidated (Security-003), DN injection (Security-005), no issuer/audience validation (Security-006), idle-timeout reset on refresh (Security-007). | +| 6 | Performance & resource management | ☑ | N+1 scope-rule query in `RoleMapper` (Security-008). `LdapConnection` correctly disposed via `using`. | +| 7 | Design-document adherence | ☑ | StartTLS unsupported and Secure cookie missing both contradict the design doc; design also says "Windows Integrated Authentication" in Responsibilities, contradicting its own Authentication section (Security-010). | +| 8 | Code organization & conventions | ☑ | `SecurityOptions` correctly owned by the component; repository interface in Commons. No issues found. | +| 9 | Testing coverage | ☑ | No tests for `RoleMapper` N+1 behavior, DN-injection inputs, StartTLS path, or idle-timeout-after-refresh. Insecure-config combinations under-tested (Security-011). | +| 10 | Documentation & comments | ☑ | `SecurityOptions` XML docs say direct bind uses `cn={username}` while the search filter uses `uid=` — comment is misleading (covered under Security-004). | + +## Findings + +### Security-001 — StartTLS upgrade path is unreachable dead code + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:37-47` | + +**Description** + +When `LdapUseTls` is true the code sets `connection.SecureSocketLayer = true` (LDAPS). +The subsequent StartTLS block is guarded by `if (_options.LdapUseTls && !connection.SecureSocketLayer)`. +Because `SecureSocketLayer` was just set to `true`, the second condition `!connection.SecureSocketLayer` +is always false, so `connection.StartTls()` is never called. The design doc explicitly +states LDAP connections must use **"LDAPS (port 636) or StartTLS"** — StartTLS is in +practice unsupported. A deployment that intends to use StartTLS on port 389 would get a +plaintext LDAPS-mode connection attempt that fails, or worse, an operator may disable +TLS entirely to make it work, sending credentials in cleartext. + +**Recommendation** + +Introduce an explicit transport mode (e.g. `LdapTransport { Ldaps, StartTls, None }`) +or a separate `LdapUseStartTls` flag. For StartTLS, leave `SecureSocketLayer` false, +call `connection.Connect`, then call `connection.StartTls()` and verify the negotiated +session is encrypted before binding. Remove the unreachable conditional. + +**Resolution** + +_Unresolved._ + +### Security-002 — Authentication cookie is not marked `Secure` + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Security/ServiceCollectionExtensions.cs:16-23` | + +**Description** + +`AddCookie` sets `HttpOnly = true` and `SameSite = Strict` but never sets +`options.Cookie.SecurePolicy`. The ASP.NET Core default is `CookieSecurePolicy.SameAsRequest`, +which permits the cookie (carrying the embedded JWT — a bearer credential) to be sent +over plain HTTP. The design doc states the cookie is **"HttpOnly and Secure (requires +HTTPS)"**. As written, the module does not enforce that requirement; a misconfigured or +HTTP-fronted deployment would transmit the session token in cleartext. + +**Recommendation** + +Set `options.Cookie.SecurePolicy = CookieSecurePolicy.Always` in `AddCookie`. Consider +also setting `ExpireTimeSpan` and `SlidingExpiration` to align the cookie lifetime with +the documented 15-minute JWT / 30-minute idle policy. + +**Resolution** + +_Unresolved._ + +### Security-003 — JWT signing key length is never validated + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Security/JwtTokenService.cs:33`, `src/ScadaLink.Security/SecurityOptions.cs:42` | + +**Description** + +`SecurityOptions.JwtSigningKey` defaults to `string.Empty` and is fed directly into +`new SymmetricSecurityKey(Encoding.UTF8.GetBytes(_options.JwtSigningKey))` with no +validation. HMAC-SHA256 requires a key of at least 256 bits (32 bytes); a short or empty +key produces a trivially forgeable token. The `SecurityHardeningTests` comment claims a +minimum length is "enforced", but no code in this module enforces it — the test only +asserts that a 32+ char key works. A deployment with a missing or short `JwtSigningKey` +would start successfully and issue weakly-signed tokens. + +**Recommendation** + +Validate `JwtSigningKey` at startup — fail fast if it is empty or shorter than 32 bytes. +Use an `IValidateOptions` validator or guard in the `JwtTokenService` +constructor so a weak key is rejected before any token is issued. + +**Resolution** + +_Unresolved._ + +### Security-004 — Search filter uses `uid=` while fallback DN construction uses `cn=` + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:66`, `:138`, `:157-159` | + +**Description** + +`AuthenticateAsync` and `ResolveUserDnAsync` build the search filter as +`(uid={username})`, but the no-service-account fallback in `ResolveUserDnAsync` +constructs the bind DN as `cn={username},{LdapSearchBase}`. The `SecurityOptions.LdapServiceAccountDn` +XML comment also documents the fallback as `cn={username},{LdapSearchBase}`. A directory +keyed on `uid` will succeed via search-then-bind but fail via the direct-bind fallback +(and vice versa). The attribute used for lookup is hard-coded and inconsistent across +the two code paths, so the two configuration modes are not interchangeable. + +**Recommendation** + +Introduce a single configurable `LdapUserIdAttribute` (default `uid`) and use it +consistently in both the search filter and the fallback DN. Update the XML doc to match. + +**Resolution** + +_Unresolved._ + +### Security-005 — DN injection in the no-service-account bind fallback + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:157-159` | + +**Description** + +When no service account is configured, the user-supplied `username` is interpolated +directly into a distinguished name: `$"cn={username},{LdapSearchBase}"`. `EscapeLdapFilter` +escapes *search-filter* metacharacters, but DN construction requires a different +escaping scheme (RFC 4514 — `,`, `+`, `"`, `\`, `<`, `>`, `;`, leading/trailing spaces). +No DN escaping is applied here. A username such as `victim,ou=admins` alters the DN +structure, allowing a caller to attempt a bind as a different DN than intended. Combined +with the `username.Contains('=')` shortcut at line 129 — which lets a caller supply a +full arbitrary DN — the fallback path gives the client undue control over the bind +identity. + +**Recommendation** + +Apply RFC 4514 DN-component escaping to `username` before interpolation, or use the +LDAP library's DN-builder API. Reconsider the `Contains('=')` shortcut — accepting a +raw DN from untrusted input is risky; restrict it or remove it. + +**Resolution** + +_Unresolved._ + +### Security-006 — JWT validation disables issuer and audience checks + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Security/JwtTokenService.cs:67-75`, `:56-59` | + +**Description** + +`ValidateToken` sets `ValidateIssuer = false` and `ValidateAudience = false`, and +`GenerateToken` never sets an `iss` or `aud`. With a shared symmetric HMAC key, any +other system or component that signs JWTs with the same key would produce tokens this +service accepts. While the design states the key is shared only between the two central +nodes, omitting issuer/audience binding removes a cheap defense-in-depth control and +makes accidental key reuse (e.g. the same secret used for another internal token) +silently exploitable. + +**Recommendation** + +Set a fixed `Issuer` and `Audience` (e.g. `"scadalink-central"`) when generating tokens +and enable `ValidateIssuer`/`ValidateAudience` with the matching expected values during +validation. + +**Resolution** + +_Unresolved._ + +### Security-007 — Idle-timeout claim is reset on every token refresh + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Security/JwtTokenService.cs:40`, `:111-123` | + +**Description** + +The design states the 30-minute idle timeout is tracked via a "last-activity timestamp +in the token", and `IsIdleTimedOut` reads the `LastActivity` claim. But `RefreshToken` +calls `GenerateToken`, which unconditionally writes `LastActivity = DateTimeOffset.UtcNow`. +Token refresh fires whenever a request arrives within ~5 minutes of expiry. The result +is that `LastActivity` reflects *token issuance time*, not genuine user activity — and +since refresh itself is a request, the timestamp keeps moving forward. A more subtle +consequence: the idle window is effectively measured from the last refresh, not the +last real interaction, so the documented "no requests within the idle window" semantics +are not faithfully implemented. The claim name `LastActivity` is also misleading. + +**Recommendation** + +Decide explicitly how activity is tracked. Either (a) carry the original `LastActivity` +forward across refreshes and update it only on real request handling in the middleware, +or (b) rename the claim to `IssuedAt`/`TokenCreated` and document that the idle window +is measured from issuance. Whichever is chosen, ensure `IsIdleTimedOut` and the refresh +path agree on the semantics. + +**Resolution** + +_Unresolved._ + +### Security-008 — N+1 query loading site-scope rules in `RoleMapper` + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.Security/RoleMapper.cs:25-48` | + +**Description** + +`MapGroupsToRolesAsync` first calls `GetAllMappingsAsync`, then inside the per-mapping +loop calls `GetScopeRulesForMappingAsync(mapping.Id, ct)` once for every matched +Deployment mapping. This is an N+1 query pattern executed on the login hot path and on +every 15-minute token refresh. With multiple site-scoped Deployment groups it issues a +round-trip per group. + +**Recommendation** + +Add a repository method that loads scope rules for a set of mapping IDs in one query +(or eager-loads them with the mappings), and resolve all scope rules with a single call. + +**Resolution** + +_Unresolved._ + +### Security-009 — CancellationToken not honored inside `Task.Run` LDAP calls + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:42`, `:46`, `:51`, `:56-57`, `:67-73`, `:135`, `:139-145` | + +**Description** + +The synchronous Novell LDAP calls are wrapped in `Task.Run(() => ..., ct)`. The `ct` +argument only prevents the work item from *starting* if cancellation is already +signaled; once a `connection.Connect`/`Bind`/`Search` call is in progress it cannot be +cancelled. A cancelled or timed-out login request will continue to occupy a thread-pool +thread and an LDAP connection until the blocking call returns on its own. There is also +no explicit network/operation timeout configured on the `LdapConnection`. + +**Recommendation** + +Configure `LdapConnection.ConnectionTimeout` and search/operation time limits so a +hung LDAP server cannot pin a thread indefinitely. Document that `ct` only guards +work-item scheduling, or implement a timeout-with-disconnect fallback. + +**Resolution** + +_Unresolved._ + +### Security-010 — Design doc contradicts itself on Windows Integrated Authentication + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `docs/requirements/Component-Security.md:13` (vs. `:23`) | + +**Description** + +The Responsibilities section states the component authenticates "using Windows +Integrated Authentication", but the Authentication section (line 23) and CLAUDE.md +explicitly state **"No Windows Integrated Authentication ... authenticates directly +against LDAP/AD, not via Kerberos/NTLM"** — which is what the code actually does +(direct LDAP bind). The Responsibilities line is stale and contradicts both the rest of +the doc and the implementation. + +**Recommendation** + +Fix `Component-Security.md:13` to say "using a direct LDAP/Active Directory bind" +to match the implemented behavior and the rest of the document. + +**Resolution** + +_Unresolved._ + +### Security-011 — Missing tests for security-critical paths + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.Security.Tests/UnitTest1.cs` | + +**Description** + +The test suite covers happy paths well but omits several security-relevant cases: +no test exercises the StartTLS path (Security-001), the DN-injection / `Contains('=')` +fallback inputs (Security-005), JWT validation with a too-short or empty signing key +(Security-003), `IsIdleTimedOut` returning true after a token has been refreshed +(Security-007), or the `uid`/`cn` mismatch in the no-service-account path (Security-004). +The integration `SecurityHardeningTests` only asserts default option values, not +enforcement. The test file is still named `UnitTest1.cs`. + +**Recommendation** + +Add negative/edge-case tests for the items above, particularly key-length rejection, +DN-escaping of hostile usernames, and idle-timeout behavior across a refresh. Rename +`UnitTest1.cs` to a descriptive name. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/SiteEventLogging/findings.md b/code-reviews/SiteEventLogging/findings.md new file mode 100644 index 0000000..c92fb07 --- /dev/null +++ b/code-reviews/SiteEventLogging/findings.md @@ -0,0 +1,402 @@ +# Code Review — SiteEventLogging + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.SiteEventLogging` | +| Design doc | `docs/requirements/Component-SiteEventLogging.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 11 | + +## Summary + +The SiteEventLogging module is small and broadly well-structured: a SQLite-backed +recorder (`SiteEventLogger`), a query service with keyset pagination, a background +purge service, and a thin Akka actor bridge. The query path is parameterised +correctly (no SQL injection) and reasonably well tested. However, the storage-cap +enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because +`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the +database shrink and over-deletes the entire table when triggered. There is also a +genuine concurrency hazard: the purge service and query service share the single +`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge +running on the background thread can collide with a write or a query on another +thread. The `LogEventAsync` API is synchronous despite its name and `Task` return, +which silently blocks Akka actor threads on disk I/O. Other findings concern the +cluster-singleton placement of the handler actor (which can pin to the standby +node), missing indexes for common query filters, retention/cap purge not enforcing +the requirement strictly, and several documentation/maintainability issues. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). | +| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. | +| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). | +| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. | +| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. | +| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). | +| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). | +| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). | +| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). | +| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). | + +## Findings + +### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` | + +**Description** + +`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to +reclaim space, then re-measures the database size via `page_count * page_size`. +`incremental_vacuum` only has any effect when the database was created with +`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the +database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`, +`incremental_vacuum` is silently ignored and `page_count` does not decrease when +rows are deleted (free pages are retained in the file). Consequently the +`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The +storage-cap feature required by the design ("configurable maximum database size... +oldest events are purged first") is therefore non-functional — it cannot bring the +file back under the cap. + +**Recommendation** + +Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are +created (it must be set before table creation or followed by a full `VACUUM` to take +effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge +deletes, or measure logical data size (e.g. `page_count - freelist_count` times +`page_size`) instead of relying on `incremental_vacuum`. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` | + +**Description** + +Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch, +so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row +batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table +is completely empty. The design states the cap should purge "the oldest events... +first" to stay within budget, not wipe the whole log. When the cap is hit (e.g. +during an alarm storm) this destroys all retained diagnostic history rather than +trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap` +masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an +empty table, so the over-delete behaviour is never exercised against a realistic cap. + +**Recommendation** + +Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when +the file is genuinely under the cap. Add a guard so the loop stops once +`currentSizeBytes` has stopped decreasing across iterations, and add a test with a +non-zero cap and a known oversized dataset to assert that only the oldest events are +removed. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock + +| | | +|--|--| +| Severity | High | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` | + +**Description** + +`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via +`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach +into `_eventLogger.Connection` and execute commands directly, without acquiring +`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from +event-recording callers and from the actor that drives the query service). A single +`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the +purge thread and a recording thread (or query thread) can throw +`SqliteException`/`InvalidOperationException` ("DataReader already open", +"connection busy") or corrupt command state. The purge `DELETE` and the recorder +`INSERT` racing is the most likely collision because event recording is continuous. + +**Recommendation** + +Funnel all access to the connection through a single synchronisation point: either +expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the +purge and query services their own dedicated `SqliteConnection` instances (SQLite +supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout` +makes this safer). Do not share one `SqliteConnection` across threads. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` | + +**Description** + +`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the +stated intent that "queries always reach the active node". However, an Akka.NET +cluster singleton is pinned to the *oldest* member of the role, which is not the +same concept as the SCADA "active node" (the node currently running the Deployment +Manager singleton / serving live traffic). The design doc is explicit: "Only the +active node generates and stores events... the new active node starts logging to its +own SQLite database." The event-log SQLite file is node-local and unreplicated. +Nothing guarantees the event-log singleton co-locates with the active node, so a +remote query can be served by the standby node and read that node's near-empty +database, returning no events even though the active node has a full log. The +explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually +happens. + +**Recommendation** + +Either (a) host the query handler as a normal per-node actor and route queries to +the active node explicitly (the node owning the Deployment Manager singleton), or +(b) make the event-log writer follow the same singleton so the writer and the query +handler are guaranteed co-located. Reconcile the design doc and the inline comment +with whichever model is chosen. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` | + +**Description** + +`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its +body is entirely synchronous: it takes `lock (_writeLock)`, runs +`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`. +Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)` +(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`, +`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write, +and any contention on `_writeLock`, executes inline on the Akka actor thread of the +calling subsystem. Under an event burst (alarm storm, script failure loop) this +serialises actor threads on disk I/O and the global write lock, degrading the +hot-path subsystems the design intends to keep responsive. + +**Recommendation** + +Either make recording genuinely asynchronous (offload to a dedicated single-threaded +writer / `Channel` consumer so callers truly fire-and-forget), or rename the +method to `LogEvent` and document that it blocks, so callers can decide. Given the +design's emphasis on not impacting runtime subsystems, an internal queue with a +background flush is preferable. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` | + +**Description** + +`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`. +The query service also filters on `severity` (`severity = $severity`) and performs +`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no +index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a +1 GB database and a 500-row page size, severity-filtered and keyword queries do full +table scans on every page. The design explicitly lists keyword search as a supported, +expected query type. + +**Recommendation** + +Add an index on `severity` (or a composite index aligned with common filter +combinations such as `(event_type, severity, id)`). For keyword search, consider an +FTS5 virtual table over `message` and `source`, or accept the scan but document the +cost. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:25`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:26`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34` | + +**Description** + +Both `EventLogPurgeService` and `EventLogQueryService` take `ISiteEventLogger` via +DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. They +then access the `internal SqliteConnection Connection` property to run arbitrary SQL. +This defeats the purpose of the interface abstraction, makes the registration +fragile (any `ISiteEventLogger` that is not exactly `SiteEventLogger` causes an +`InvalidCastException` at construction), and leaks the database handle and raw SQL +surface out of the recorder. It is also the root cause of the unsynchronised +connection sharing in SiteEventLogging-003. + +**Recommendation** + +Introduce a proper data-access abstraction (e.g. an `IEventLogStore` with +`Insert`, `Query`, `PurgeOlderThan`, `PurgeToSize`, `GetSizeBytes`) that owns the +connection and its locking, and inject that into the recorder, query, and purge +services. Remove the `internal Connection` property and the concrete-type downcasts. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-008 — Event-recording write failures are silently swallowed + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` | + +**Description** + +If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the +exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still +returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so +they cannot detect failure. The event log is the site's diagnostic audit trail; a +sustained write failure (for example a locked-database storm caused by the +unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to +operators except a local log line that nobody is watching. There is no failure +counter, no health-metric hook, and no retry. + +**Recommendation** + +Expose a failure signal: increment a counter that the Health Monitoring component +can surface (the design notes script/alarm error rates are derived from the event +log — a logging outage should be visible). At minimum, escalate repeated failures to +a Warning/Error health metric rather than only a local log line. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` | + +**Description** + +The interface XML doc states "Record an event asynchronously." and the method is +named `LogEventAsync`, but the implementation is fully synchronous (see +SiteEventLogging-005). The documentation and naming are misleading: a reader will +reasonably assume the write is offloaded and the caller's thread is not blocked, +which is false. The `details` parameter doc says "Optional JSON details" but nothing +validates or requires JSON, so callers may pass arbitrary text. + +**Recommendation** + +Align the name, signature, and documentation with the actual behaviour — either make +the method genuinely asynchronous or rename to `LogEvent` and correct the doc. +Clarify that `details` is free-form text unless JSON is actually enforced. + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.SiteEventLogging.Tests/` | + +**Description** + +The test suite covers recording, query filtering/pagination, and basic purge, but +several critical behaviours are untested: + +- `EventLogHandlerActor` has no test — the actor message contract + (`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified. +- No test exercises purge running concurrently with active writes/queries, so the + connection-sharing race (SiteEventLogging-003) is invisible to CI. +- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the + no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the + file shrinks or that only oldest events are removed under a realistic cap. +- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`, + `ErrorMessage`) has no test. +- `SiteEventLogger.Dispose` semantics (logging after dispose returns + `Task.CompletedTask`) and re-entrant dispose are untested. + +**Recommendation** + +Add tests for the actor bridge, a concurrency stress test (purge + write + query in +parallel), a realistic non-zero-cap purge test asserting size reduction and +oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection). + +**Resolution** + +_Unresolved._ + +### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions` + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` | + +**Description** + +`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka +actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already +implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through +this method. The placeholder is dead code: it is either never called or called with +no effect, and the comment is stale. A reader looking for where the event-log actor +is wired up will be misdirected. + +**Recommendation** + +Either implement the actor registration here and have `AkkaHostedService` call it +(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove +the misleading comment. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/SiteRuntime/findings.md b/code-reviews/SiteRuntime/findings.md new file mode 100644 index 0000000..41cf5de --- /dev/null +++ b/code-reviews/SiteRuntime/findings.md @@ -0,0 +1,564 @@ +# Code Review — SiteRuntime + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.SiteRuntime` | +| Design doc | `docs/requirements/Component-SiteRuntime.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 16 | + +## Summary + +The SiteRuntime module is broadly well-structured: the actor hierarchy matches the +design doc, supervision strategies are explicit, and the trigger/alarm evaluation +logic is thorough. However the review surfaced one genuinely serious correctness +defect — `Instance.SetAttribute` never routes writes to the Data Connection Layer +for data-sourced attributes, contradicting a core design decision and silently +turning device writes into local-only static overrides. Several other findings +cluster around two themes: (1) actor-thread discipline is violated in a few hot +paths (blocking `.GetAwaiter().GetResult()` calls on the actor thread, a fragile +fixed-delay reschedule for redeployment), and (2) the site-local repositories reach +into `SiteStorageService` private state via reflection and mint entity IDs with the +non-deterministic `string.GetHashCode()`. Script execution runs on the default +thread pool rather than a dedicated blocking dispatcher (the code acknowledges this +in a comment but ships it anyway). Test coverage exists for the coordinator actors, +persistence and scripting, but the short-lived execution actors, the replication +actor, and the repositories are untested. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | SetAttribute mis-routing, deploy double-count, redeploy reschedule race. | +| 2 | Akka.NET conventions | ✓ | Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply. | +| 3 | Concurrency & thread safety | ✓ | `_attributes` dictionary shared with child actors by reference; `_executionCounter` is actor-confined (OK). | +| 4 | Error handling & resilience | ✓ | Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design). | +| 5 | Security | ✓ | Trust-model validation is substring-based and weak; reflection used to read private fields. | +| 6 | Performance & resource management | ✓ | Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout. | +| 7 | Design-document adherence | ✓ | SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform. | +| 8 | Code organization & conventions | ✓ | Repositories reflect into another class; synthetic IDs non-deterministic. | +| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. | +| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). | + +## Findings + +### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer + +| | | +|--|--| +| Severity | High | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:204` | + +**Description** + +The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and +"Script Runtime API") states that `Instance.SetAttribute` on a *data-connected* +attribute must send a write request to the DCL, which writes to the physical +device, and that the in-memory value is **not** optimistically updated. For *static* +attributes it updates memory and persists an override. + +The implementation makes no such distinction. `ScriptRuntimeContext.SetAttribute` +unconditionally sends a `SetStaticAttributeCommand`, and `InstanceActor.HandleSetStaticAttribute` +unconditionally treats every write as a static override: it mutates `_attributes`, +publishes an `AttributeValueChanged` with hard-coded `"Good"` quality, notifies +children, and persists a SQLite override. A script writing a data-sourced attribute +therefore never reaches the device, the write failure can never be returned +synchronously to the script, and the in-memory value diverges from the device +until the next subscription update overwrites it. The persisted override is also +wrong: data-sourced attributes should not have static overrides. + +**Recommendation** + +In `InstanceActor`, look up the target attribute in `_configuration.Attributes`. If +it has a non-empty `DataSourceReference`, issue a DCL write (e.g. a `WriteTagRequest` +to `_dclManager`) and surface success/failure to the caller; do not persist an +override and do not optimistically mutate `_attributes`. Only attributes with no +data source reference should follow the current static-override path. Consider +splitting the message into `SetStaticAttributeCommand` vs `SetDataAttributeCommand`, +or branching inside the handler. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-002 — `RouteInboundApiSetAttributes` always treats writes as static overrides + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:632` | + +**Description** + +`RouteInboundApiSetAttributes` (handling `Route.To().SetAttribute(s)` from the +Inbound API) emits a `SetStaticAttributeCommand` for every attribute, so it inherits +the same defect as SiteRuntime-001: writes to data-sourced attributes never reach +the device and are instead persisted as static overrides. In addition the response +is sent back as unconditionally successful (`true`) before the Instance Actor has +even processed the command, so a non-existent attribute or a future DCL write +failure is reported to the external caller as success. + +**Recommendation** + +Route through the same corrected `InstanceActor` write handler as SiteRuntime-001 so +the static-vs-data distinction is honoured. The optimistic ack is acceptable for +fire-and-forget static writes per the doc, but the XML comment should make the +limitation explicit, and once data-attribute writes are supported they need a real +response path. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name + +| | | +|--|--| +| Severity | High | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:222` | + +**Description** + +`HandleDeploy` stops an existing Instance Actor with `Context.Stop` and then +reschedules the same `DeployInstanceCommand` to itself after a hard-coded 500 ms, +hoping the child has fully terminated by then. `Context.Stop` is asynchronous; the +child is only removed from the parent's children collection after it actually stops +(including running `PostStop` on its descendants). If a deeply nested or slow +hierarchy takes longer than 500 ms, `CreateInstanceActor` calls `Context.ActorOf` +with a name that still belongs to the terminating child and throws +`InvalidActorNameException`. The `_instanceActors` dictionary check does not prevent +this — the dictionary entry is removed immediately, but the Akka child registry is +not. The 500 ms delay is also unconditionally added to every redeploy latency. + +**Recommendation** + +Watch the terminating child (`Context.Watch`) and recreate the Instance Actor only +after receiving the `Terminated` message, instead of guessing with a timer. Buffer +or stash the in-flight `DeployInstanceCommand` (and any further commands for that +instance) until termination completes. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-004 — `_totalDeployedCount` is incremented on redeployment of an existing instance + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:239` | + +**Description** + +In `HandleDeploy`, the existing-actor branch (line 223) reschedules the command and +returns. When the rescheduled command runs, no actor exists, so the code falls +through to the "new instance" branch and executes `_totalDeployedCount++` +(line 239). A redeployment is an *update* of an already-deployed instance, not a new +one, so the deployed count is over-counted by one on every redeploy. +`StoreDeployedConfigAsync` uses UPSERT semantics, so the SQLite row count does not +grow, but the in-memory `_totalDeployedCount` (reported to the health collector via +`UpdateInstanceCounts`) drifts upward and the reported "disabled" count becomes +wrong. + +**Recommendation** + +Only increment `_totalDeployedCount` when the instance is genuinely new. Either +track whether this deploy replaced an existing config, or derive the deployed count +from storage / the union of running actors and disabled configs rather than +maintaining a hand-incremented counter. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-005 — Deployment reports `Success` to central before persistence completes + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:272` | + +**Description** + +`HandleDeploy` replies to central with `DeploymentStatus.Success` immediately after +creating the Instance Actor, while the SQLite persistence (`StoreDeployedConfigAsync` ++ `ClearStaticOverridesAsync`) runs asynchronously on a `Task.Run`. If persistence +fails, `HandleDeployPersistenceResult` only logs an error — central has already been +told the deployment succeeded. On a subsequent node restart or failover the instance +will not be re-created (it is not in SQLite), so the deployment is silently lost +despite central recording success. This contradicts the design's intent that the +site is the durable source of truth for deployed configs. + +**Recommendation** + +Persist the config before replying, or treat a persistence failure as a deployment +failure and send a corrective `DeploymentStatusResponse`/health signal to central. +At minimum, do not report `Success` until the config row is committed. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-006 — Site-local repositories read `SiteStorageService` private field via reflection + +| | | +|--|--| +| Severity | Medium | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:183`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:181` | + +**Description** + +Both repositories' `CreateConnection()` use `Type.GetField("_connectionString", +BindingFlags.NonPublic | BindingFlags.Instance)` to extract the private connection +string out of `SiteStorageService`. This is brittle (any rename or refactor of the +field breaks it at runtime, not compile time), defeats encapsulation, and the +accompanying XML comment openly describes it as a "pragmatic" hack and is internally +contradictory (it states a connection string is "passed separately at DI +registration time" which is not what the code does). It also sits awkwardly against +the project's own script trust model, which forbids `System.Reflection` in scripts. + +**Recommendation** + +Expose the connection string properly: add an `ISiteStorageConnectionProvider` +(already referenced in `ServiceCollectionExtensions` XML docs but not used), or have +`SiteStorageService` expose a `CreateConnection()` factory, and inject that into the +repositories. Remove the reflection entirely. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-007 — Synthetic entity IDs use the non-deterministic `string.GetHashCode()` + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:241`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:254` | + +**Description** + +`GenerateSyntheticId` computes `name.GetHashCode() & 0x7FFFFFFF`. On .NET Core, +`string.GetHashCode()` is randomized per process by default, so the "stable +deterministic synthetic ID" promised by the XML comment is not stable at all — it +changes every time the process restarts. Any caller that obtained an ID and later +calls `GetExternalSystemByIdAsync`/`GetNotificationListByIdAsync` after a restart +will fail to find the entity. It also risks collisions: distinct names can hash to +the same 31-bit value, and `GetExternalSystemByIdAsync` would then return the wrong +row. + +**Recommendation** + +Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first +bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or +better, change the repository contract to key these site-local artifacts by name +rather than synthesising integer IDs. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-008 — Blocking `.GetAwaiter().GetResult()` on the actor thread during startup + +| | | +|--|--| +| Severity | Medium | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:479` | + +**Description** + +`LoadSharedScriptsFromStorage` is called synchronously from +`HandleStartupConfigsLoaded` (the actor's message handler) and performs +`_storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult()` followed by Roslyn +compilation of every shared script. This blocks the DeploymentManager singleton's +mailbox thread for the full duration of the SQLite read and all shared-script +compilation. On the default dispatcher this also ties up a thread-pool thread and +risks thread-pool starvation, and the singleton cannot process any other message +(deployments, lifecycle commands, debug routing) until it returns. The rest of the +class correctly uses `PipeTo`/`ContinueWith`. + +**Recommendation** + +Load shared scripts asynchronously and `PipeTo(Self)` an internal message, the same +pattern already used for `StartupConfigsLoaded`. Perform compilation either inside +the piped continuation handler (still on the actor thread but at least off the +synchronous startup path) or on a dedicated background task whose result is piped +back. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher + +| | | +|--|--| +| Severity | Medium | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:72`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:289`, `src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs:57` | + +**Description** + +The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run +on a *dedicated blocking I/O dispatcher*. The code does not do this: `ScriptActor.SpawnExecution` +and `AlarmActor.SpawnAlarmExecution` create the execution actors with no +`.WithDispatcher(...)`, and the execution itself runs inside a bare `Task.Run`, +i.e. on the shared .NET thread pool. The `// NOTE: In production, configure a +dedicated ... dispatcher` comments acknowledge the gap but it ships unconfigured. +Scripts can perform synchronous blocking I/O (`Database.Connection`, synchronous +`ExternalSystem.Call`); running them on the shared pool can starve it and stall +unrelated Akka dispatchers and HTTP request handling under load. + +**Recommendation** + +Define the dedicated dispatcher in HOCON and chain `.WithDispatcher(...)` on the +execution actor `Props`. If the `Task.Run` model is kept, run script bodies on a +dedicated `TaskScheduler` / bounded scheduler rather than the global pool. Either +way, remove the "in production, configure…" comments by actually configuring it. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-010 — `EnsureDclConnections` never updates a connection whose configuration changed + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:413` | + +**Description** + +`EnsureDclConnections` tracks created connections in `_createdConnections` and skips +any name already present (`if (_createdConnections.Contains(name)) continue;`). The +skip is purely name-based: if a redeployment (or an artifact deployment) changes the +endpoint, credentials, backup endpoint, or `FailoverRetryCount` of an existing +connection, the new configuration is silently ignored and the DCL keeps using the +stale `CreateConnectionCommand`. There is no `UpdateConnectionCommand` path. The +design states that after artifact deployment the site is fully self-contained with +current configuration; this caching breaks that for connection changes. + +**Recommendation** + +Compare the incoming connection config against the last one sent and re-issue a +create/update command when it differs, or have the DCL treat `CreateConnectionCommand` +as idempotent upsert and always forward it. Key the cache on a config hash, not just +the name. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs:52` | + +**Description** + +`ValidateTrustModel` enforces the script trust model by doing raw `string.Contains` / +`IndexOf` on the script source text for forbidden namespace strings. This is +unreliable in both directions: + +- **Bypass (under-inclusive):** the check looks only for the literal namespace + strings. A script can reach forbidden APIs without ever writing `System.IO` etc. — + e.g. via fully-qualified type use through aliasing, `global::`-prefixed names, or + simply because the namespace is already imported transitively. The compilation + references include `typeof(object).Assembly` (the whole of `System.Private.CoreLib`, + which contains `System.IO.File`, `System.Threading.Thread`, `System.Reflection`, + etc.), so forbidden types are fully resolvable at compile time and the only barrier + is this textual scan. +- **False positives (over-inclusive):** any occurrence of the substring in a comment, + string literal, or an unrelated identifier (e.g. a variable named `ProcessThreading`) + triggers a violation; the `AllowedExceptions` logic only rescues exact prefixes. +- The dead `isAllowed` variable at line 64 is computed and never used. + +**Recommendation** + +Enforce the trust model with a Roslyn `SyntaxWalker`/semantic analysis (inspect +resolved symbols and their containing namespaces/assemblies), or restrict the +compilation's metadata references and `AssemblyLoadContext` so forbidden types are +genuinely unavailable, rather than relying on source-text matching. Remove the +unused `isAllowed` variable. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-012 — `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Scripts/ScopeAccessors.cs:28` | + +**Description** + +`AttributeAccessor`'s indexer getter calls `_ctx.GetAttribute(...).GetAwaiter().GetResult()`, +synchronously blocking the script-execution thread on an actor Ask. Combined with +SiteRuntime-009 (scripts run on the shared thread pool) this means a script that +reads several attributes via `Attributes["X"]` holds a pool thread blocked for each +round-trip. The async variants (`GetAsync`/`SetAsync`) exist but the ergonomic +indexer encourages the blocking path. The XML comment notes "Reads block on the +actor Ask" but does not warn about the thread-pool impact. + +**Recommendation** + +Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained +to that pool, which is acceptable; until then, document the cost clearly and prefer +steering script authors to the async accessors. Consider making the indexer +internal-only and exposing only the async API. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-013 — `HandleUnsubscribeDebugView` does nothing despite documented behaviour + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:414` | + +**Description** + +`HandleUnsubscribeDebugView` is documented ("Debug view unsubscribe — removes +subscription") and the actor registers a handler for `UnsubscribeDebugViewRequest`, +but the body only logs a debug message — there is no subscription state in the +Instance Actor to remove. The design places the actual subscription lifecycle in +`SiteStreamManager` (`Subscribe`/`Unsubscribe`/`RemoveSubscriber`), so the Instance +Actor genuinely has nothing to do here. The handler and its XML comment are +therefore misleading: a reader expects it to tear down a subscription. + +**Recommendation** + +Either remove the no-op handler and route `UnsubscribeDebugViewRequest` to wherever +the `SiteStreamManager` subscription is actually cancelled, or correct the XML +comment to state explicitly that subscription teardown is handled by +`SiteStreamManager` and this handler is a no-op acknowledgement. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:219`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:389` | + +**Description** + +`EvaluateExpressionTrigger` (ScriptActor) and `EvaluateExpression` (AlarmActor) run a +compiled Roslyn script with `.RunAsync(...).GetAwaiter().GetResult()` directly inside +the actor's `AttributeValueChanged` message handler. This blocks the coordinator +actor's mailbox thread for up to the 2-second timeout on every monitored attribute +change. Coordinator actors are on the default dispatcher and process the hot path of +attribute-change fan-out; a slow expression delays all other messages to that actor +and consumes a thread-pool thread for the duration. The inline comments correctly +note CPU-bound expressions are not interruptible but do not address the +mailbox-blocking concern. + +**Recommendation** + +Trigger expressions are expected to be cheap, but to keep the actor responsive +consider evaluating them off the actor thread (pipe the boolean result back as an +internal message) or pre-compiling to a plain delegate that executes near-instantly +without the Roslyn scripting `RunAsync` machinery. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-015 — `LoggerFactory` created per Instance Actor and never disposed + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:746` | + +**Description** + +`CreateInstanceActor` does `var loggerFactory = new LoggerFactory();` for every +Instance Actor it creates, uses it once to produce an `ILogger`, and +never disposes it. `LoggerFactory` is `IDisposable`. With up to 500 instances (and +churn from redeployments) this leaks a factory per instance, and the produced +loggers are detached from the application's configured logging providers, so +Instance Actor logs may not be routed/filtered consistently with the rest of the +host. + +**Recommendation** + +Inject the application's `ILoggerFactory` (or an `ILogger` factory +delegate) into `DeploymentManagerActor` via DI and reuse it, rather than newing one +up per child. Do not create a fresh `LoggerFactory` in a hot creation path. + +**Resolution** + +_Unresolved._ + +### SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.SiteRuntime.Tests/` | + +**Description** + +The test project covers the coordinator actors (`InstanceActor`, `ScriptActor`, +`AlarmActor`, `DeploymentManagerActor`), persistence, scripting and streaming, but a +search of the test sources finds no references to `ScriptExecutionActor`, +`AlarmExecutionActor`, `SiteReplicationActor`, `SiteExternalSystemRepository`, or +`SiteNotificationRepository`. These cover critical paths: script timeout/failure +handling and result reply, alarm on-trigger execution, peer config/S&F replication +(including the `SendToPeer` no-peer drop), and the reflection-based repository reads. +Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass) +would likely have been caught by targeted tests. + +**Recommendation** + +Add unit/integration tests for the execution actors (success, timeout, exception, +Ask-reply, PoisonPill self-stop), `SiteReplicationActor` (outbound forward, inbound +apply, peer tracking on cluster events), and the two repositories (round-trip read, +synthetic-ID lookup, missing-row behaviour). + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/StoreAndForward/findings.md b/code-reviews/StoreAndForward/findings.md new file mode 100644 index 0000000..a56e4d7 --- /dev/null +++ b/code-reviews/StoreAndForward/findings.md @@ -0,0 +1,465 @@ +# Code Review — StoreAndForward + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.StoreAndForward` | +| Design doc | `docs/requirements/Component-StoreAndForward.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 13 | + +## Summary + +The Store-and-Forward module is small and readable, with a clean SQLite persistence +layer, a sensible service API, and reasonable test coverage of the storage and service +happy paths. However the review surfaced two issues that undermine the module's core +purpose. First, the active delivery path never invokes the `ReplicationService` — +`ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered +messages are not replicated to the standby node and the design's failover-durability +guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met. +Second, there is an off-by-one in retry accounting: the immediate-failure path stores a +buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N` +is actually attempted `N` times in total rather than one immediate attempt plus `N` +retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional +themes: SQLite connection-per-call with no transactional grouping of multi-statement +operations, no concurrency guard against a parked message being retried while the +sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented +status set, and untested critical paths (retry-due timing, replication-from-active, +the actor bridge). None of the findings are blockers for compilation, but the +replication and retry-count issues are functional defects against the design. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). | +| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. | +| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). | +| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). | +| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. | +| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). | +| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. | +| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). | +| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). | +| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). | + +## Findings + +### StoreAndForward-001 — Replication to standby is never triggered by the active node + +| | | +|--|--| +| Severity | Critical | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` | + +**Description** + +`ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark` +to forward buffer operations to the standby node, but a codebase-wide search shows these +methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync` +line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park +(`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService` +and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and +`ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design +doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to +forward every buffer operation to the standby so that, on failover, the new active node +"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database +stays empty and a failover loses the entire buffer — a data-loss defect against a core +requirement. + +**Recommendation** + +Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue` +after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`, +and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update +`ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that +asserts the replication handler observes each operation type. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` | + +**Description** + +`EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162) +when no delivery handler is registered for the category. The retry sweep +(`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and +`return`s without touching the message. No caller in the codebase ever calls +`RegisterDeliveryHandler` (the External System Gateway, Notification Service and +Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered +message lands in this dead state: it is persisted, counts toward buffer depth, but can +never be retried, delivered or parked. It will sit Pending forever. Either the handler +registration is missing from Host/gateway startup, or the "buffer for later" path is a +silent trap. Either way the engine cannot deliver anything. + +**Recommendation** + +Decide the intended contract. If handlers are expected to be registered before +`EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no +handler exists rather than silently buffering an undeliverable message, and wire +`RegisterDeliveryHandler` calls in Host startup for all three categories. If late +registration is intended, the retry sweep should treat a still-missing handler as a +transient condition with bounded logging rather than a permanent no-op. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1 + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` | + +**Description** + +On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with +`message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before +the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233). +Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first* +retry sweep without ever being retried, because after the immediate attempt `RetryCount` +is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting +the design intent that the immediate attempt and the retry budget are distinct; +(2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is +seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount` +shown to operators off by one relative to configured `MaxRetries`. The +`EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`, +locking in the ambiguity. + +**Recommendation** + +Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts, +including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound +*retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the +immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust +the comparison. Update the affected test to match the chosen semantics. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract + +| | | +|--|--| +| Severity | Medium | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` | + +**Description** + +The XML comment on the handler delegate (lines 37–40) says "Returns true on success, +throws on transient failure. Permanent failures should return false (message will NOT +be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`, +a handler returning `false` does not "not buffer" — the message is already buffered, and +the code *parks* it immediately (lines 218–224). The comment describes only the +`EnqueueAsync` immediate path and misleads anyone implementing a handler about what +`false` means once a message is in the retry loop. + +**Recommendation** + +Reword the contract to cover both paths explicitly: `true` = delivered (remove from +buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a +retry); exception = transient failure (buffer / increment retry). + +**Resolution** + +_Unresolved._ + +### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep + +| | | +|--|--| +| Severity | Medium | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` | + +**Description** + +`RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then +processes them one by one (line 184), `await`-ing delivery for each. Meanwhile +`RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via +`ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because +each operation opens its own SQLite connection and there is no row-level coordination, +an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently +delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a +now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row +to Pending while the sweep simultaneously parks the same in-flight message, the operator +intent is silently overwritten. The `Interlocked` guard only prevents *overlapping +sweeps*, not sweep-vs-management races. + +**Recommendation** + +Funnel all message-state mutations through a single serialization point — e.g. perform +all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor) +so the actor mailbox serialises them, or make status transitions conditional in SQL +(e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected +row count. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` | + +**Description** + +`GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two +commands on the same connection with no surrounding transaction. A concurrent +enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with +the returned page (e.g. total reported as 51 while only 50 distinct parked rows now +exist, or a row visible in the page but excluded from the count). For a paginated UI +this produces flickering totals and occasional off-by-one page math. + +**Recommendation** + +Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent +snapshot, or accept the staleness and document it. A transaction is cheap here and +removes the inconsistency. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees + +| | | +|--|--| +| Severity | Low | +| Category | Akka.NET conventions | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` | + +**Description** + +The three handlers call a `Task`-returning service method and chain `.ContinueWith(...) +.PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is +safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation +on a thread-pool thread and the captured continuation builds the response objects there +— acceptable since it only touches locals, but it bypasses the idiomatic +`PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a +line touching actor state inside the continuation. There is also no `TaskContinuationOptions`, +so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`, +but only by convention). + +**Recommendation** + +Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ..., +failure: ex => ...)`, which is the documented Akka pattern, keeps response construction +off the actor thread safely, and makes the success/failure branches explicit. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call + +| | | +|--|--| +| Severity | Low | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` | + +**Description** + +Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls +`OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug, +but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/ +`RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection +churn under load. With no max buffer size (by design) the buffer can grow large, so the +per-message connection acquisition is a measurable overhead on the hot retry path. + +**Recommendation** + +Consider a batched retry API that opens one connection (and one transaction) per sweep, +or pass an open connection into the per-message update calls. At minimum, document that +the design relies on the Sqlite connection pool for acceptable performance. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` | + +**Description** + +`OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in +`RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller +thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures +the delegate once so it will not NRE, but there is no synchronisation around the event +field itself; a subscriber added/removed concurrently with a raise has no defined +ordering. More importantly, subscriber callbacks run synchronously on the timer thread, +so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a +subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is +swallowed as a "transient failure," wrongly incrementing the message's retry count). + +**Recommendation** + +Snapshot the delegate (already done) and additionally wrap subscriber invocation in a +`try/catch` so a faulting logging subscriber cannot be misclassified as a delivery +failure. Document that handlers must be fast and non-throwing, or dispatch activity +notifications asynchronously. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` | + +**Description** + +`RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL` +but leaves `last_attempt_at` unchanged (line 203–206). The retry-due query +(`GetMessagesForRetryAsync`, line 101–105) selects Pending rows where +`last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after +exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since +that stale timestamp is almost certainly already greater than the retry interval, so the +operator-retried message is attempted on the very next sweep regardless of the +configured interval. That is probably the desired behaviour (operator wants it tried +now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the +behaviour would instead be "try immediately" by accident rather than by design. + +**Recommendation** + +Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on +re-queue so the message is unambiguously due now, or set it to "now" so it waits one +interval. Document the chosen behaviour in the method's XML comment. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` | + +**Description** + +The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses +`Pending` and `Parked` — `InFlight` and `Delivered` are never assigned (delivered +messages are deleted, not marked `Delivered`). Meanwhile the Component design doc +("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the +code's enum drifts from the doc in two directions: it carries dead members the doc does +not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message +mid-retry is indistinguishable from one that has never been attempted. + +**Recommendation** + +Reconcile the enum with the design. Either drop the unused members and update the doc, +or implement the documented `retrying` state and use `InFlight` to mark a message the +sweep is actively delivering (which would also help with finding 005). + +**Resolution** + +_Unresolved._ + +### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` | + +**Description** + +`StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the +`sf_messages` table and is also carried across the network inside `ReplicationOperation` +(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states +that entity classes are persistence-ignorant POCOs in Commons and that message contracts +follow additive-only evolution. Because this type doubles as a replication wire contract +but lives in the component assembly, it is not co-located with the other Commons +entities and its evolution is not governed by the additive-only message-contract rule. +This is a borderline case (the type is site-local), but the cross-node use via +`ReplicationOperation` makes it a de-facto message contract. + +**Recommendation** + +Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons +`Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or +introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the +local persistence model. Document the decision. + +**Resolution** + +_Unresolved._ + +### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge + +| | | +|--|--| +| Severity | Medium | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` | + +**Description** + +The existing tests cover storage CRUD and the service happy/failure paths well, but +three important behaviours are untested: (1) the retry-due time filter in +`GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`, +so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is +never exercised with a non-zero interval; a message that is *not yet due* should be +skipped, and that is never verified. (2) Replication from the active side — no test +asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap +behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no +test at all — the Query/Retry/Discard request-to-response mapping and the +`ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch. + +**Recommendation** + +Add tests for: a non-zero retry interval where a recently-attempted message is excluded +and an older one is included; active-side replication invocation per operation type +(once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`, +including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and +invalid-JSON payloads. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/TemplateEngine/findings.md b/code-reviews/TemplateEngine/findings.md new file mode 100644 index 0000000..212674a --- /dev/null +++ b/code-reviews/TemplateEngine/findings.md @@ -0,0 +1,487 @@ +# Code Review — TemplateEngine + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.TemplateEngine` | +| Design doc | `docs/requirements/Component-TemplateEngine.md` | +| Status | Reviewed | +| Last reviewed | 2026-05-16 | +| Reviewer | claude-agent | +| Commit reviewed | `9c60592` | +| Open findings | 14 | + +## Summary + +The Template Engine is a pure central-side modeling library: stateless services +over `ITemplateEngineRepository` plus four static helper classes (collision, cycle, +lock, resolver). It has no Akka actors and no direct concurrency, so the Akka and +thread-safety categories produce nothing of substance. The code is generally +well-structured and the cascade-based composition model (derived templates owned by +composition slots) is consistently applied. However the review surfaced several real +correctness gaps. The most serious are in **flattening**: composed alarms and scripts +nested below the first level are silently dropped, derived templates omit base +alarms entirely (breaking per-slot alarm override), and the alarm-on-trigger-script +resolution step is an empty placeholder so that whole validation rule is dead. +Validation has two security-relevant weaknesses — the forbidden-API scan is a naive +substring match and the brace-balance "compile" check mispredicts on verbatim / +interpolated / raw string literals. Several documented behaviours (collision check on +create, optimistic concurrency on instance state) are claimed but not implemented. +Themes: validation that is weaker than the design promises, and asymmetric handling +of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths. + +## Checklist coverage + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. | +| 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. | +| 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. | +| 4 | Error handling & resilience | ✓ | `Result` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. | +| 5 | Security | ✓ | No auth checks in-module (delegated to callers per design). Script trust-model enforcement is weak — see TemplateEngine-006 and TemplateEngine-007. | +| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` reloaded on most member edits; one genuine N+1 in `TemplateDeletionService` (TemplateEngine-009). No `IDisposable` leaks (`JsonDocument`/streams disposed). | +| 7 | Design-document adherence | ✓ | Drift found: recursive composition not fully implemented in flattening; `DataType` enum naming differs from doc; optimistic-concurrency claim. | +| 8 | Code organization & conventions | ✓ | POCO entities in Commons, repo interfaces in Commons, Options pattern N/A (no options here). Duplicate deletion logic (TemplateEngine-014). | +| 9 | Testing coverage | ✓ | Tests exist for every file, but the dead/placeholder paths (TemplateEngine-004, 005) and deep nesting (TemplateEngine-001) are not exercised. | +| 10 | Documentation & comments | ✓ | Mostly accurate; a misleading converter comment (TemplateEngine-011) and a stale enum/doc mismatch (TemplateEngine-012). | + +## Findings + +### TemplateEngine-001 — Deeply nested composed members are dropped during flattening + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:211`, `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:535`, `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:609` | + +**Description** + +The design doc states composition supports "recursive nesting of feature modules" +and that nested paths extend as `[Outer].[Inner].[Member]`. `ResolveComposedAttributes` +only descends **one** level of nesting: it resolves the directly-composed module, then +its immediate child compositions, and stops. A module composed three or more levels +deep contributes no attributes to the flattened configuration. `ResolveComposedAlarms` +and `ResolveComposedScripts` are worse — they handle only the first (direct) level and +do not descend at all, so any alarm or script in a nested composed module is dropped +entirely. `CollisionDetector` and `TemplateResolver` recurse fully, so collision +detection and the authoring UI will show members that the deployed configuration +silently lacks. + +**Recommendation** + +Replace the hand-unrolled one/two-level loops with a single recursive walk +(carrying the accumulated path prefix) for attributes, alarms, and scripts, matching +the recursion already in `TemplateResolver.AddComposedMembers` and +`CollisionDetector.CollectComposedMembers`. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-002 — Derived templates omit all base alarms; composed alarms cannot be overridden per slot + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:799` | + +**Description** + +`BuildDerivedTemplate` copies the base template's `Attributes` and `Scripts` into the +new derived template as `IsInherited = true` placeholder rows so they can be overridden +per composition slot, but there is **no loop for `Alarms`**. The derived template +therefore has zero alarm rows. The `TemplateAlarm` entity also has no `IsInherited` or +`LockedInDerived` fields (unlike `TemplateAttribute` / `TemplateScript`), so even if a +copy loop were added there is no mechanism to mark a copied alarm as inherited or to +override one. The design's Override Granularity section explicitly requires composed +alarm fields (Priority, Trigger thresholds, Description, On-Trigger Script) to be +overridable. As written, a composed module's alarms cannot be tuned for the slot they +are used in. + +**Recommendation** + +Add an alarm copy loop to `BuildDerivedTemplate` and add `IsInherited` / +`LockedInDerived` fields to `TemplateAlarm`, mirroring `TemplateAttribute`. Update +`UpdateAlarmAsync` to honour them as `UpdateAttributeAsync` / `UpdateScriptAsync` +already do. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-003 — `UpdateAttributeAsync` lets a non-locked attribute change its fixed DataType / DataSourceReference + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:285` | + +**Description** + +`LockEnforcer.ValidateAttributeOverride` correctly rejects a change to `DataType` or +`DataSourceReference` (both "fixed by the defining level" per the design). But the +caller only honours that error when the attribute is already locked: + +```csharp +var granularityError = LockEnforcer.ValidateAttributeOverride(existing, proposed); +if (granularityError != null && existing.IsLocked) + return Result.Failure(granularityError); +``` + +Lines 293-294 then unconditionally apply `existing.DataType = proposed.DataType` and +`existing.DataSourceReference = proposed.DataSourceReference`. For the common case of an +unlocked attribute, the fixed-field guard is dead and both fields are silently mutable, +violating the override-granularity rule. (The lock-error branch of the same helper is +also redundant — a locked attribute already returns earlier inside the helper.) + +**Recommendation** + +Remove the `&& existing.IsLocked` condition so the granularity error is always +returned, and stop assigning `DataType` / `DataSourceReference` from `proposed` in the +apply block. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-004 — Alarm on-trigger script references are never resolved (empty placeholder) + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:695` | + +**Description** + +`ResolveAlarmScriptReferences` is invoked as Step 7 of `Flatten` but its body is empty +— only a comment describing what it should do. Consequently every +`ResolvedAlarm.OnTriggerScriptCanonicalName` stays `null`. This has two downstream +effects: (1) `SemanticValidator`'s "on-trigger script must exist" check +(`SemanticValidator.cs:209`) can never fire, so the design-mandated validation of +alarm on-trigger script references is silently absent; (2) `RevisionHashService` and +`DiffService` both hash/compare `OnTriggerScriptCanonicalName`, so a change to which +script an alarm triggers never affects the revision hash and is invisible to the diff +— a real staleness-detection gap. + +**Recommendation** + +Implement the resolution: map each alarm's `OnTriggerScriptId` (set on `TemplateAlarm`) +to the canonical name of the corresponding resolved script, accounting for composition +prefixes. If the design intends scripts to be referenced by name within scope, document +and implement that consistently. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-005 — Collision validation is skipped when creating a child template + +| | | +|--|--| +| Severity | High | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:56` | + +**Description** + +`CreateTemplateAsync` contains a block guarded by `if (parentTemplateId.HasValue)` that +loads `GetAllTemplatesAsync` and then does nothing but hold a comment — it never runs a +collision check. A child template created with a parent inherits the parent's members; +if the child is later given members (via `AddAttributeAsync` etc.) those calls do run +`CollisionDetector`, but the create path itself performs no naming-collision validation +and `UpdateTemplateAsync` only validates collisions on a name change. The design states +naming collisions are design-time errors that must block a save. The dead block is also +confusing and allocates an unused full-table read. + +**Recommendation** + +Either run a real collision check on the to-be-created template (including its +inherited members) or delete the dead block and its unused query. If create-time +collisions are genuinely impossible because a fresh template has no members, document +that explicitly instead of leaving a no-op. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-006 — Forbidden-API enforcement is a naive substring scan (bypassable and false-positive prone) + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Validation/ScriptCompiler.cs:21`, `src/ScadaLink.TemplateEngine/Validation/ValidationService.cs:318` | + +**Description** + +`ScriptCompiler.ForbiddenPatterns` is checked with `code.Contains(pattern)`. This is +both under- and over-inclusive against the script trust model: +- **Bypass:** `using System.IO;` followed by `File.ReadAllText(...)` contains no + `System.IO.` token; `using static System.IO.File;`, namespace aliases, and + `global::System.IO.File` all evade the literal patterns. +- **False positive:** a string literal, comment, or attribute name containing the text + `System.IO.` is flagged as a forbidden API even though it is inert. + +The same patterns are reused for trigger-expression validation +(`CheckExpressionSyntax`), inheriting the same weakness. The file comment acknowledges +this is interim until Roslyn is wired in, but the trust model is security-relevant and +the gap should be tracked. + +**Recommendation** + +Defer real enforcement to the Roslyn-based compiler (semantic symbol analysis of +referenced types/namespaces) rather than text matching. Until then, document the +limitation prominently and treat the substring scan as advisory, not authoritative. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-007 — Brace-balance "compilation" misjudges verbatim / interpolated / raw strings + +| | | +|--|--| +| Severity | Medium | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Validation/ScriptCompiler.cs:54`, `src/ScadaLink.TemplateEngine/SharedScriptService.cs:124` | + +**Description** + +`ScriptCompiler.TryCompile` tracks string state with a single `inString` flag toggled +on `"` and an escaped-quote check of `code[i-1] != '\\'`. It does not understand +verbatim strings (`@"..."` where `""` is the escape and `\` is literal), interpolated +strings (`$"{...}"` whose braces are code, not text), raw string literals (`"""..."""`), +or char literals. A script with a verbatim string containing a brace, an interpolated +string, or a `'}'` char literal will be wrongly rejected as having mismatched braces — +blocking a valid script from deployment. `SharedScriptService.ValidateSyntax` is even +cruder: it counts braces/brackets/parens with no string or comment awareness at all, so +any string literal containing one of those characters produces a false syntax error. + +**Recommendation** + +Once the Roslyn compiler is available, parse with `CSharpSyntaxTree.ParseText` and +inspect diagnostics instead of hand-rolling a tokenizer. If an interim check must +remain, at minimum handle verbatim/interpolated/char literals or scope the check down +to something that cannot false-positive. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-008 — `SetAlarmOverrideAsync` accepts overrides for unknown / composed alarms with no validation + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Services/InstanceService.cs:178` | + +**Description** + +`SetAlarmOverrideAsync` looks up the alarm by name among the template's **direct** +alarms only. When the lookup returns `null` — which is the case for every composed +(path-qualified) alarm as well as for a genuinely non-existent name — the method skips +the lock check and proceeds to persist the override. This means: (1) an override can be +created for an alarm that does not exist (a silent dead record), and (2) a composed +alarm that is `IsLocked` at the template level can be overridden, bypassing the lock +rule. `SetAttributeOverrideAsync` by contrast rejects unknown attribute names. The +inline comment acknowledges the gap but the behaviour is inconsistent and risky. + +**Recommendation** + +Resolve the full effective alarm set (via the resolver / flattening) so composed +alarms are found, reject overrides whose canonical name is not in that set, and apply +the lock check to composed alarms too. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-009 — N+1 query in `TemplateDeletionService.CanDeleteTemplateAsync` + +| | | +|--|--| +| Severity | Medium | +| Category | Performance & resource management | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Services/TemplateDeletionService.cs:75` | + +**Description** + +Check 3 ("other templates compose it directly") loads all templates and then issues a +separate `GetCompositionsByTemplateIdAsync` call **inside a loop over every template** +— one round-trip per template in the database. The composition information needed is +already reachable via `t.Compositions` on the templates returned by +`GetAllTemplatesAsync` (which `TemplateService.DeleteTemplateAsync` uses for the +equivalent check at line 162). The loop scales linearly with the template count on +every delete-precheck and every actual delete. + +**Recommendation** + +Use the `Compositions` navigation already loaded by `GetAllTemplatesAsync`, or add a +single repository call that returns all compositions, rather than querying per +template. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-010 — `InstanceService` documents optimistic concurrency that is not implemented + +| | | +|--|--| +| Severity | Medium | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Services/InstanceService.cs:9` | + +**Description** + +The class summary states instances support "Enabled/disabled state with optimistic +concurrency". `EnableAsync`, `DisableAsync`, `AssignToAreaAsync` and the override/binding +mutators all perform a plain read-modify-write with no version token, `RowVersion`, or +concurrency check. Two concurrent enable/disable requests last-writer-wins with no +detection. Either the doc is stale (the design's optimistic-concurrency decision +applies to *deployment status records*, not instance state) or a concurrency token was +intended and is missing. + +**Recommendation** + +If last-write-wins is acceptable for instance state, correct the XML doc. If optimistic +concurrency is required, add a concurrency token to `Instance` and surface a conflict +result. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-011 — `SortedPropertiesConverterFactory` is dead code with a misleading comment + +| | | +|--|--| +| Severity | Low | +| Category | Documentation & comments | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:136` | + +**Description** + +`SortedPropertiesConverterFactory.CanConvert` always returns `false` and +`CreateConverter` always returns `null`, so the factory registered in +`CanonicalJsonOptions` does nothing. The class comment claims it "ensures properties are +serialized in alphabetical order for deterministic output", and the options comment says +"Ensure consistent ordering" — both are false. Determinism actually relies entirely on +the `Hashable*` records being hand-declared with alphabetically-ordered properties (plus +camelCase). That works today but is fragile: a future contributor adding a property out +of alphabetical order silently changes every revision hash, and the dead converter gives +false confidence that ordering is enforced programmatically. + +**Recommendation** + +Either implement the converter to genuinely sort properties, or delete it and replace +the comments with an explicit note that determinism depends on the manual property +ordering of the `Hashable*` records (ideally enforced by a test). + +**Resolution** + +_Unresolved._ + +### TemplateEngine-012 — `DataType` enum naming diverges from the design doc + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/Validation/SemanticValidator.cs:18` | + +**Description** + +The design doc (Attribute section) lists data types as "Boolean, Integer, Float, +String". The actual `DataType` enum is `Boolean, Int32, Float, Double, DateTime, +Binary`. `SemanticValidator.NumericDataTypes` correctly hard-codes the real names +(`Int32`, `Float`, `Double`), so the code is internally consistent, but the design doc +is stale — it omits `Double`, `DateTime`, `Binary` and calls the integer type +"Integer". This makes the doc an unreliable reference for which trigger-operand types +are numeric. + +**Recommendation** + +Update `docs/requirements/Component-TemplateEngine.md` to list the actual enum members, +or rename the enum to match the doc if "Integer" is the intended canonical name. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-013 — `ToDictionary(t => t.Id)` throws on duplicate IDs; cycle detectors overload Id 0 as a sentinel + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/CycleDetector.cs:30`, `src/ScadaLink.TemplateEngine/CycleDetector.cs:38` | + +**Description** + +Across the static helpers, `allTemplates.ToDictionary(t => t.Id)` is used freely; if the +caller ever passes a list containing two templates with the same `Id` (e.g. a +not-yet-saved template assigned `Id == 0`, or duplicated input) the call throws an +unhandled `ArgumentException` rather than returning a `Result` failure. Separately, +`CycleDetector` uses `0` as the "no parent" sentinel (`currentId != 0`, +`ParentTemplateId ?? 0`) and `DetectInheritanceCycle` / `DetectCrossGraphCycle` ignore a +proposed parent/composed id of `0`. EF identity keys start at 1 so this is currently +benign, but the overload is fragile — an in-memory or test template with `Id == 0` +would be treated as "no template" and cycle checks would be silently skipped. + +**Recommendation** + +Guard the dictionary builds (or use a grouping/`ToLookup`) and validate input, and use +`int?`/`-1` rather than `0` as the no-parent sentinel so a real id of 0 is never +special. + +**Resolution** + +_Unresolved._ + +### TemplateEngine-014 — Template-deletion constraint logic is duplicated and divergent + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:109`, `src/ScadaLink.TemplateEngine/Services/TemplateDeletionService.cs:27` | + +**Description** + +`TemplateService.DeleteTemplateAsync` and `TemplateDeletionService.CanDeleteTemplateAsync` +both implement the "can this template be deleted" rules (instances, child templates, +derived templates, composing templates). The two implementations have already drifted: +`TemplateService` reads composing templates from the in-memory `t.Compositions` +navigation while `TemplateDeletionService` issues per-template +`GetCompositionsByTemplateIdAsync` calls (see TemplateEngine-009), they format error +messages differently, and `TemplateService` returns on the first failing category while +`TemplateDeletionService` accumulates all of them. A future rule change must be made in +two places or behaviour will diverge further. + +**Recommendation** + +Make `TemplateService.DeleteTemplateAsync` delegate to `TemplateDeletionService` (or +vice versa) so the constraint logic lives in exactly one place. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/_template/findings.md b/code-reviews/_template/findings.md new file mode 100644 index 0000000..2b56b6a --- /dev/null +++ b/code-reviews/_template/findings.md @@ -0,0 +1,67 @@ +# Code Review — + + + +| Field | Value | +|-------|-------| +| Module | `src/ScadaLink.` | +| Design doc | `docs/requirements/Component-.md` | +| Status | Not yet reviewed \| In progress \| Reviewed | +| Last reviewed | YYYY-MM-DD | +| Reviewer | | +| Commit reviewed | `` | +| Open findings | 0 | + +## Summary + +One short paragraph: overall health of the module, themes across findings, and +anything notable that is not a finding. + +## Checklist coverage + +Confirm every category was examined. Record "No issues found" where applicable. + +| # | Category | Examined | Notes | +|---|----------|----------|-------| +| 1 | Correctness & logic bugs | ☐ | | +| 2 | Akka.NET conventions | ☐ | | +| 3 | Concurrency & thread safety | ☐ | | +| 4 | Error handling & resilience | ☐ | | +| 5 | Security | ☐ | | +| 6 | Performance & resource management | ☐ | | +| 7 | Design-document adherence | ☐ | | +| 8 | Code organization & conventions | ☐ | | +| 9 | Testing coverage | ☐ | | +| 10 | Documentation & comments | ☐ | | + +## Findings + + + +### -001 — + +| | | +|--|--| +| Severity | Critical \| High \| Medium \| Low | +| Category | | +| Status | Open \| In Progress \| Resolved \| Won't Fix \| Deferred | +| Location | `src/ScadaLink./.cs:` | + +**Description** + +What is wrong and why it matters. + +**Recommendation** + +Concrete suggested fix. + +**Resolution** + +_Unresolved._ +