docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
This commit is contained in:
Joseph Doherty
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions

4
.gitignore vendored
View File

@@ -32,3 +32,7 @@ TestResults/
**/logs/ **/logs/
site_events.db site_events.db
data/ data/
# Claude Code local files
.claude/settings.local.json
.claude/scheduled_tasks.lock

View File

@@ -0,0 +1,442 @@
# Code Review — CLI
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.CLI` |
| Design doc | `docs/requirements/Component-CLI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 13 |
## Summary
The CLI is a small, well-structured HTTP client over the Management API. The command-tree
construction is consistent and repetitive in a good way: every subcommand funnels through
`CommandHelpers.ExecuteCommandAsync`, which centralizes URL/credential resolution, HTTP
dispatch, and response handling. There are no Akka.NET concerns (the CLI is a pure HTTP
client) and no concurrency-sensitive code apart from the `debug stream` SignalR handler.
The dominant theme is **graceful-degradation gaps**: several user-supplied inputs (malformed
URLs, malformed `--bindings`/`--overrides` JSON, non-JSON success bodies) are deserialized
or constructed without `try/catch`, so a normal user mistake surfaces as an unhandled
exception with a stack trace instead of a clean error message and exit code 1. A second
theme is **dead configuration**: the `SCADALINK_FORMAT` environment variable and the
`defaultFormat` config-file field are loaded by `CliConfig` but never consulted by any
command, so the documented format-precedence chain does not work. The third theme is
**substantial design-document drift**: `Component-CLI.md` describes a name-keyed,
`--file`-based command surface that bears little resemblance to the implemented
ID-keyed, flag-based surface. Test coverage exercises `OutputFormatter`, `CliConfig`, and
`CommandHelpers.HandleResponse`, but the HTTP client, the `debug stream` path, the JSON
argument parsing, and the command-tree wiring are untested.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Format precedence is broken (CLI-001); empty/non-JSON success bodies crash table rendering (CLI-002, CLI-003). |
| 2 | Akka.NET conventions | ☑ | Not applicable — CLI is a pure HTTP/SignalR client with no Akka.NET runtime (design doc confirms). No issues. |
| 3 | Concurrency & thread safety | ☑ | Only `debug stream` is concurrent; `CancellationTokenSource` is never disposed (CLI-011). Exit-code resolution after Ctrl+C is loose (CLI-012). |
| 4 | Error handling & resilience | ☑ | Unhandled exceptions on malformed URL (CLI-004) and malformed JSON arguments (CLI-005); `StartAsync` cancellation is misreported (CLI-010). |
| 5 | Security | ☑ | `--password` on the command line leaks into process listings / shell history with no env-var or prompt alternative (CLI-006). |
| 6 | Performance & resource management | ☑ | `HttpClient` per invocation is acceptable for a one-shot CLI. `CancellationTokenSource` leak noted in CLI-011. |
| 7 | Design-document adherence | ☑ | `Component-CLI.md` is heavily stale relative to the implemented command surface (CLI-007). |
| 8 | Code organization & conventions | ☑ | Consistent and clean; `CliConfig.DefaultFormat` is loaded but unused (covered by CLI-001). Minor: `--format` not validated (CLI-008). |
| 9 | Testing coverage | ☑ | No tests for `ManagementHttpClient`, `DebugCommands`, command-tree wiring, or JSON argument parsing (CLI-013). |
| 10 | Documentation & comments | ☑ | `Component-CLI.md` mismatch (CLI-007); the in-repo `README.md` is reasonably accurate. Minor exit-code doc mismatch (CLI-009). |
## Findings
### CLI-001 — `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:18`, `src/ScadaLink.CLI/Commands/DebugCommands.cs:45`, `src/ScadaLink.CLI/CliConfig.cs:37-39` |
**Description**
`CliConfig.Load()` reads `SCADALINK_FORMAT` and the `defaultFormat` config-file field into
`CliConfig.DefaultFormat`, and `Component-CLI.md` documents a format-precedence chain
(command-line option → env var → config file). However, every command resolves the format
with `var format = result.GetValue(formatOption) ?? "json";` and `formatOption` is created
in `Program.cs:11` with `DefaultValueFactory = _ => "json"`. `GetValue` therefore always
returns a non-null value ("json" when the flag is absent), so the `?? "json"` fallback never
fires and `config.DefaultFormat` is never consulted. The env var and config-file format
settings are dead code: `scadalink site list` always outputs JSON regardless of
`SCADALINK_FORMAT=table` or a `defaultFormat` entry in `~/.scadalink/config.json`. The
documented behaviour silently does not work.
**Recommendation**
Either remove the `--format` option's `DefaultValueFactory` and have `CommandHelpers`
resolve precedence explicitly (`result.GetValue(formatOption)``config.DefaultFormat`),
or detect whether the option was explicitly supplied (`result.GetResult(formatOption)`) and
only then override the config value. Apply the same fix to `DebugCommands.BuildStream`.
**Resolution**
_Unresolved._
### CLI-002 — Empty success body crashes table rendering with an unhandled exception
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:59-68`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:78-80` |
**Description**
`ManagementHttpClient.SendCommandAsync` returns `JsonData = responseBody` for any
success status code, including a 200/204 with an empty body. `HandleResponse` then tests
`response.JsonData != null` — an empty string is non-null — and for `--format table`
calls `WriteAsTable(response.JsonData)`, which immediately does `JsonDocument.Parse(json)`.
`JsonDocument.Parse("")` throws `JsonException`, which is not caught anywhere, so a
command that legitimately returns no body (e.g. a delete that returns 204) terminates with
a stack trace instead of a clean success message.
**Recommendation**
In `HandleResponse`, treat a null-or-whitespace `JsonData` as a "command succeeded, no
output" case (print nothing or `(ok)`), and return 0 before attempting to parse.
**Resolution**
_Unresolved._
### CLI-003 — Non-JSON success body crashes table rendering
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/CommandHelpers.cs:80` |
**Description**
`WriteAsTable` calls `JsonDocument.Parse(json)` with no `try/catch`. If the server returns
a success status but a body that is not valid JSON (a proxy/HTML error page returned with
a 200, a plain-text message, etc.), the CLI throws an unhandled `JsonException`. The
error-path code in `ManagementHttpClient` (lines 52-61) already defensively wraps
`JsonDocument.Parse` in a `try/catch`; the success path and `WriteAsTable` do not get the
same treatment.
**Recommendation**
Wrap the `JsonDocument.Parse` in `WriteAsTable` in a `try/catch`; on failure, fall back to
printing the raw body verbatim (as the JSON path already does at line 66).
**Resolution**
_Unresolved._
### CLI-004 — Malformed `--url` throws an unhandled `UriFormatException`
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/ManagementHttpClient.cs:13` |
**Description**
The `ManagementHttpClient` constructor does `new Uri(baseUrl.TrimEnd('/') + "/")` with no
validation. If the user passes a malformed URL (e.g. `--url localhost:9001` without a
scheme, or `--url ""`), `new Uri(...)` throws `UriFormatException`. This call is not
guarded by the `try/catch` in `SendCommandAsync` (it happens in the constructor at
`CommandHelpers.cs:50`), so a common typo terminates the CLI with a stack trace rather
than the documented "connection failure → exit 1 with a descriptive message".
**Recommendation**
Validate the URL before constructing the client — e.g. `Uri.TryCreate(url, UriKind.Absolute, out _)` in `CommandHelpers.ExecuteCommandAsync` and `DebugCommands.BuildStream` — and emit a
clean `INVALID_URL` error with exit code 1 on failure.
**Resolution**
_Unresolved._
### CLI-005 — Malformed `--bindings` / `--overrides` JSON throws unhandled exceptions
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/InstanceCommands.cs:55-58`, `src/ScadaLink.CLI/Commands/InstanceCommands.cs:181-182` |
**Description**
`set-bindings` deserializes the `--bindings` argument with
`JsonSerializer.Deserialize<List<List<JsonElement>>>(...)` and then indexes `p[0]`/`p[1]`
and calls `p[0].GetString()!` / `p[1].GetInt32()`. `set-overrides` deserializes `--overrides`
with `JsonSerializer.Deserialize<Dictionary<string, string?>>(...)`. None of this is wrapped
in a `try/catch`. Invalid JSON throws `JsonException`; a pair with fewer than two elements
throws `ArgumentOutOfRangeException`; a non-string/non-int element throws `InvalidOperationException`. All of these surface as raw stack traces, so a user typo in a JSON argument
crashes the CLI instead of producing a clean validation error and exit code 1.
**Recommendation**
Wrap the parsing in `try/catch (JsonException ...)` (and guard the pair length / element
kinds), and on failure call `OutputFormatter.WriteError(...)` with an `INVALID_ARGUMENT`
code and return 1.
**Resolution**
_Unresolved._
### CLI-006 — Password is passed as a command-line argument with no safer alternative
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.CLI/Program.cs:9`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:36-44` |
**Description**
Credentials are supplied only via `--username` / `--password`. A password on the command
line is visible to any local user via the process list (`ps`, `/proc/<pid>/cmdline`) and is
typically persisted into shell history. Unlike the management URL — which can also come
from `SCADALINK_MANAGEMENT_URL` or the config file — there is no environment-variable
fallback, no `--password-stdin`, and no interactive prompt for the password. For a tool
explicitly intended for CI/CD automation this materially increases the chance of credential
leakage.
**Recommendation**
Add a `SCADALINK_PASSWORD` environment variable fallback and/or a `--password-stdin`
option (read the password from stdin), and document that `--password` on the command line
is discouraged. Optionally prompt interactively when stdin is a TTY and no password was
supplied.
**Resolution**
_Unresolved._
### CLI-007 — `Component-CLI.md` command surface is substantially stale
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-CLI.md:51-211` (vs. all files under `src/ScadaLink.CLI/Commands/`) |
**Description**
The "Command Structure" section of the design doc no longer matches the implemented CLI.
Examples of the drift:
- The doc keys most operations by **name** (`template get <name>`, `instance get <code>`,
`site get <site-id>`); the implementation keys everything by integer **ID** via `--id`
(`TemplateCommands.cs:40`, `InstanceCommands.cs:31`, `SiteCommands.cs:26`).
- The doc shows `template create ... --file <path>` and `site update <site-id> --file <path>`;
the implementation has no `--file` option anywhere and instead takes individual flags
(`TemplateCommands.cs:52-72`, `SiteCommands.cs:83-115`).
- The doc lists commands that do not exist (`template diff`, `instance bind-connections`,
`instance assign-area`, `template attribute add --tag-path`, `data-connection assign/unassign`,
`security api-key enable/disable` as separate commands) and omits commands that do exist
(`instance alarm-override set/delete/list`, `external-system method` subgroup).
- The doc's `notification smtp update --file` differs from the implemented
`--server/--port/--auth-mode/--from-address` flags (`NotificationCommands.cs:72-94`).
- The doc uses `--site` for site identification in several places where the implementation
uses `--site-id` or `--identifier`.
A reader following the design doc would be unable to drive the CLI.
**Recommendation**
Regenerate the "Command Structure" section of `Component-CLI.md` from the actual command
tree (the in-repo `src/ScadaLink.CLI/README.md` is much closer to reality and could be the
source), or mark the doc's command list as illustrative and point to the README as
authoritative.
**Resolution**
_Unresolved._
### CLI-008 — `--format` value is not validated
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.CLI/Program.cs:10-11`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:60` |
**Description**
The `--format` option accepts any string. `HandleResponse` only checks
`string.Equals(format, "table", ...)`; any other value — including a typo like
`--format tabel` or `--format xml` — silently falls through to JSON output. The user gets
no feedback that their requested format was not honoured.
**Recommendation**
Restrict the option to the accepted values, e.g. `formatOption.AcceptOnlyFromAmong("json", "table")`, so `System.CommandLine` rejects invalid input with a clear parse error.
**Resolution**
_Unresolved._
### CLI-009 — Exit-code documentation does not match `HandleResponse` behaviour
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `docs/requirements/Component-CLI.md:238-249`, `src/ScadaLink.CLI/Commands/CommandHelpers.cs:75` |
**Description**
The design doc's Exit Codes table defines code 2 as "Authorization failure (insufficient
role)" and the Error Handling section says "If the server returns HTTP 403, the CLI exits
with code 2." `HandleResponse` implements `return response.StatusCode == 403 ? 2 : 1;`,
which is correct for the HTTP error path. However, the `NO_URL`, `NO_CREDENTIALS`,
`INVALID_OPERATION` (from `set-bindings`/`set-overrides`) and any other client-side failure
all return 1, and a connection failure carries `StatusCode == 0` — none of which the doc
enumerates. More importantly, an authorization failure that the server signals with a body
`code` of `UNAUTHORIZED` but an HTTP status other than 403 would be classified as a generic
error (exit 1). The mapping is purely status-driven and the doc does not state that.
**Recommendation**
Either document precisely that exit code 2 is determined solely by HTTP 403, or key the
"authorization failure" exit code off the response `code` field as well. Align the doc
with whichever is chosen.
**Resolution**
_Unresolved._
### CLI-010 — `debug stream` reports Ctrl+C during connect as a connection failure
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/DebugCommands.cs:181-189` |
**Description**
`StreamDebugAsync` calls `await connection.StartAsync(cts.Token)` inside a
`try { } catch (Exception ex)` that unconditionally reports
`"Connection failed: {ex.Message}"` with code `CONNECTION_FAILED` and returns 1. If the
user presses Ctrl+C while the connection is still being established, `cts` is cancelled and
`StartAsync` throws `OperationCanceledException`; this is caught by the generic handler and
misreported as a connection failure (with exit code 1) rather than a clean user-initiated
cancellation (exit code 0).
**Recommendation**
Catch `OperationCanceledException` separately (return 0 quietly) before the generic
`catch (Exception)` handler, mirroring how the `exitTcs.Task.WaitAsync(cts.Token)` path at
lines 209-215 already treats cancellation as graceful.
**Resolution**
_Unresolved._
### CLI-011 — `CancellationTokenSource` in `debug stream` is never disposed
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/DebugCommands.cs:89` |
**Description**
`var cts = new CancellationTokenSource();` is created in `StreamDebugAsync` but never
disposed; there is no `using` declaration and no explicit `Dispose()` call on any exit
path. `CancellationTokenSource` owns a `WaitHandle` and should be disposed. The impact is
small because the process exits shortly after, but it is an `IDisposable` left undisposed,
contrary to the review checklist's resource-management expectation.
**Recommendation**
Declare it as `using var cts = new CancellationTokenSource();` (or wrap the method body in
a `try/finally`).
**Resolution**
_Unresolved._
### CLI-012 — `debug stream` exit code is unreliable after stream termination
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CLI/Commands/DebugCommands.cs:208-227` |
**Description**
After `await exitTcs.Task.WaitAsync(cts.Token)`, the method returns
`exitTcs.Task.IsCompletedSuccessfully ? exitTcs.Task.Result : 0`. When the user cancels
with Ctrl+C, `WaitAsync` throws `OperationCanceledException` and `exitTcs` is typically
still incomplete, so the method returns 0 — correct. However, the `OnStreamTerminated`
handler and the `Closed` handler both call `exitTcs.TrySetResult`, and these run on
SignalR callback threads concurrently with the Ctrl+C path. If a stream termination and a
Ctrl+C race, the final exit code depends on which `TrySetResult` won and whether
`WaitAsync` observed completion before cancellation — the result is not deterministic. A
stream the server terminated abnormally can end up returning 0.
**Recommendation**
Resolve the exit code from a single authoritative source: after the `try/catch` around
`WaitAsync`, check `exitTcs.Task` completion explicitly and treat a Ctrl+C with no prior
result as 0, but always prefer a result that was set by `OnStreamTerminated`/`Closed`.
Consider awaiting `exitTcs.Task` without the cancellation token after a brief grace period.
**Resolution**
_Unresolved._
### CLI-013 — HTTP client, `debug stream`, and JSON-argument parsing are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.CLI.Tests/` (vs. `src/ScadaLink.CLI/ManagementHttpClient.cs`, `src/ScadaLink.CLI/Commands/DebugCommands.cs`, `src/ScadaLink.CLI/Commands/InstanceCommands.cs:55-58`) |
**Description**
The test project covers `OutputFormatter`, `CliConfig.Load`, and
`CommandHelpers.HandleResponse`. It does not cover:
- `ManagementHttpClient.SendCommandAsync` — the timeout (504), connection-failure (code 0),
and error-body-parsing paths are untested.
- The `debug stream` SignalR command — no tests at all.
- The JSON-argument parsing in `InstanceCommands` (`set-bindings`, `set-overrides`) — the
paths most likely to crash on bad input (CLI-005) have no coverage.
- Command-tree wiring — there is no test asserting that each `Build` produces the expected
subcommands/options or that the command-name derivation
(`ManagementCommandRegistry.GetCommandName`) resolves for every command type the CLI
constructs.
**Recommendation**
Add tests for `ManagementHttpClient` (using a stub `HttpMessageHandler`), for the
JSON-argument parsing helpers (extracting the parsing into testable methods), and a
smoke test that walks the root command tree and asserts every leaf command's payload type
resolves via `ManagementCommandRegistry`.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,633 @@
# Code Review — CentralUI
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.CentralUI` |
| Design doc | `docs/requirements/Component-CentralUI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 19 |
## Summary
The Central UI is a sizeable, generally well-structured Blazor Server module:
custom Bootstrap components only (no third-party UI frameworks, as required),
consistent list/form page patterns, careful disposal in most components, and a
thoughtful Roslyn-backed script editor. The most serious problem is the
**Test Run sandbox** (`ScriptAnalysisService.RunInSandboxAsync`): it compiles
and executes arbitrary user C# *in the central process* with no enforcement of
the documented script trust model — the forbidden-API list is only a Monaco
editor diagnostic, never applied before execution — so a Design user can run
`System.IO`/`Process`/`Reflection` code on the central node. Several other
themes recur: (1) per-circuit security drift — site-scoped Deployment claims
are written at login but never read, so site scoping is not enforced anywhere;
(2) Blazor render-thread and disposal hazards — background `Timer` / `Task.Delay`
callbacks and stream callbacks touch component state and `@ref` children that
may already be disposed; (3) process-global mutation (`Console.SetOut`) shared
across concurrent circuits; (4) drift from the design doc on session expiry and
on the "deployment status pushes via SignalR" claim (the page actually polls).
Testing coverage is thin for a module this large: only the script analyzer,
TreeView, schema model, and a few data-connection pages have unit tests; most
pages and the auth bridge are untested.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | DebugView cap logic, audit-log timezone, toast race — see findings. |
| 2 | Akka.NET conventions | ☑ | Module is mostly UI; `DebugStreamService` actor usage reviewed (in Communication but driven from here). No actor-convention violations in CentralUI proper. |
| 3 | Concurrency & thread safety | ☑ | `Console.SetOut` global mutation, stream/timer callbacks on non-render threads, toast `_ = Task.Delay`. |
| 4 | Error handling & resilience | ☑ | Broad `catch {}` swallowing, dangling `TaskCompletionSource` on dialog disposal. |
| 5 | Security | ☑ | Sandbox not enforcing trust model (Critical); site scoping never enforced; auth bridge reads stale HttpContext; logout CSRF. |
| 6 | Performance & resource management | ☑ | N+1 site-connection query, repeated `FilteredMessages` recomputation, full-page paginators rendering all page buttons. |
| 7 | Design-document adherence | ☑ | Session expiry diverges from "15-min sliding + 30-min idle"; Deployments polls despite "push via SignalR"; nav exposes Deployment-only pages to all roles. |
| 8 | Code organization & conventions | ☑ | Generally good; options classes absent (no appsettings binding here); no major violations. |
| 9 | Testing coverage | ☑ | Auth, sandbox-run, DebugView, Health, ParkedMessages, most pages untested. |
| 10 | Documentation & comments | ☑ | Comments are accurate and helpful; a few stale claims noted. |
## Findings
### CentralUI-001 — Test Run sandbox executes arbitrary C# with no trust-model enforcement
| | |
|--|--|
| Severity | Critical |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:171-424` |
**Description**
`RunInSandboxAsync` compiles user-supplied script code with `CSharpScript.Create`
and executes it (`script.RunAsync`) directly inside the central process. The
"sandbox" applies only a wall-clock timeout and an output-size cap. It does
**not** enforce the documented script trust model: the forbidden-API set
(`System.IO`, `System.Diagnostics`/`Process`, `System.Reflection`, `System.Net`,
threading) is checked only in `FindForbiddenApiUsages`, which feeds Monaco
editor diagnostics — it is never consulted before `RunInSandboxAsync` executes.
`DefaultOptions` references `typeof(object).Assembly` (the full BCL), so a
Design-role user can submit `System.IO.File.WriteAllText(...)`,
`System.Diagnostics.Process.Start(...)`, reflection, or raw socket code via
`POST /api/script-analysis/run` and it runs with the central host process's
full privileges. The endpoint is gated only by `RequireDesign`. This is a
remote code execution path on the central cluster node.
**Recommendation**
Before executing, run the same forbidden-API analysis used for diagnostics and
reject any script with a `SCADA001`/`SCADA002` (severity-8) marker; additionally
restrict the compilation's metadata references to the curated script API
surface, and ideally execute in an isolated `AssemblyLoadContext`/process with
constrained permissions. Treat the trust model as an execution-time gate, not
an editor hint.
**Resolution**
_Unresolved._
### CentralUI-002 — Site-scoped Deployment permissions are issued but never enforced
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Auth/AuthEndpoints.cs:63-69`; `src/ScadaLink.CentralUI/Components/Pages/Deployment/*.razor` |
**Description**
Login adds `SiteId` claims (`JwtTokenService.SiteIdClaimType`) for non-system-wide
Deployment users, and the design doc (Component-CentralUI "Responsibilities" and
CLAUDE.md Security & Auth) requires the Deployment role to be site-scoped. A
repo-wide search shows the `SiteId` claim is written at login and **never read
anywhere in CentralUI**. Deployment pages — `DebugView.razor`, `Deployments.razor`,
`InstanceCreate.razor`, `InstanceConfigure.razor`, `Topology.razor`,
`ParkedMessages.razor`, `EventLogs.razor` — list and act on every site with no
filtering by the user's permitted sites. A Deployment user scoped to one site
can deploy to, debug, and manage instances at any site.
**Recommendation**
Enforce site scoping: filter site/instance lists by the user's `SiteId` claims
(or treat the absence of `SiteId` claims as system-wide), and re-check the claim
server-side before any mutating cross-site command (deploy, enable/disable/delete,
debug stream, parked-message retry/discard). A shared helper that reads the
claims from `AuthenticationStateProvider` and exposes "permitted site ids" would
keep this consistent.
**Resolution**
_Unresolved._
### CentralUI-003 — `Console.SetOut`/`SetError` mutates process-global state across concurrent circuits
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:359-423` |
**Description**
`RunInSandboxAsync` redirects `Console.Out`/`Console.Error` to a per-call
`StringWriter`, runs the script, then restores them in `finally`. `Console.Out`
is process-global. If two users (two Blazor circuits) run Test Run concurrently,
their captured outputs interleave or cross over, and the `finally` of whichever
finishes first restores `Console.Out` to the *original* writer while the other
run is still executing — so the second run's script output is lost or written
to the real console. `RunInSandboxAsync` is `async` and the script runs on a
thread-pool thread, so concurrent execution is fully expected.
**Recommendation**
Do not redirect process-global `Console`. Provide console capture through the
script globals surface (e.g. a `TextWriter` exposed on `SandboxScriptHost` that
the sandbox API writes to), or serialize Test Run executions with a semaphore if
global redirection must be kept. Capturing per-call without global mutation is
the correct fix.
**Resolution**
_Unresolved._
### CentralUI-004 — `CookieAuthenticationStateProvider` reads `HttpContext` for the life of the circuit
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Auth/CookieAuthenticationStateProvider.cs:22-28` |
**Description**
`GetAuthenticationStateAsync` returns `_httpContextAccessor.HttpContext?.User`.
In Blazor Server, `HttpContext` is only valid during the initial HTTP request
that establishes the circuit; for the lifetime of the long-lived SignalR circuit
`IHttpContextAccessor.HttpContext` is `null` (or, worse, a stale/foreign context
if the accessor's `AsyncLocal` leaks). Any later call to
`GetAuthenticationStateAsync` — e.g. an `<AuthorizeView>` re-evaluating, or pages
that call it directly (`Sites.razor`, `Templates.razor`) — then sees an
unauthenticated principal and may render the wrong UI, or returns a stale
identity that never reflects role changes. The class derives from
`ServerAuthenticationStateProvider`, which is designed to be seeded once via
`SetAuthenticationState`; overriding `GetAuthenticationStateAsync` to read
`HttpContext` defeats that design.
**Recommendation**
Capture the authenticated principal once when the circuit is created (e.g. via
the root component / `AuthenticationStateProvider` seeding pattern used by the
Blazor Web App template) and store it on the scoped provider, instead of reading
`IHttpContextAccessor` on every call. Do not depend on `HttpContext` after the
circuit is established.
**Resolution**
_Unresolved._
### CentralUI-005 — Session expiry implementation diverges from the documented policy
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Auth/AuthEndpoints.cs:47-81`; `src/ScadaLink.CentralUI/Components/Shared/SessionExpiry.razor:18-30` |
**Description**
CLAUDE.md (Security & Auth) specifies "15-minute expiry with sliding refresh,
30-minute idle timeout." `AuthEndpoints` instead sets a single fixed
`expires_at = UtcNow + 30 minutes` claim and a 30-minute cookie `ExpiresUtc`,
with no sliding refresh and no separate idle vs absolute timeout.
`SessionExpiry.razor` schedules a single hard redirect at that fixed time. The
result is a hard 30-minute cap with no sliding renewal — an active user is
logged out mid-session, and there is no 15-minute component at all.
**Recommendation**
Either implement the documented policy (sliding 15-minute token with refresh on
activity, plus a 30-minute idle cutoff) or update the design docs to match the
fixed 30-minute model. The code and the documented decision must agree.
**Resolution**
_Unresolved._
### CentralUI-006 — Deployment status page polls every 10s despite the documented SignalR-push design
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:196-216` |
**Description**
Component-CentralUI "Real-Time Updates" states: "Deployment status:
Pending/in-progress/success/failed transitions push to the UI immediately via
SignalR (built into Blazor Server). No polling required for deployment
tracking." `Deployments.razor` instead runs a `Timer` that reloads all
deployment records and instance names from the database every 10 seconds. This
is a full N-record + instance-map reload per tick for every open circuit, and
contradicts the design. It also re-issues two repository round-trips on each
tick regardless of whether anything changed.
**Recommendation**
Implement push-based updates (an injected event/observable raised by the
Deployment Manager that the page subscribes to and renders via
`InvokeAsync(StateHasChanged)`), or amend the design doc to acknowledge polling.
If polling is kept as a fallback, fetch only changed/in-progress records.
**Resolution**
_Unresolved._
### CentralUI-007 — Monitoring nav links to Deployment-only pages are shown to all roles
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Layout/NavMenu.razor:69-78`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/EventLogs.razor:2`; `src/ScadaLink.CentralUI/Components/Pages/Monitoring/ParkedMessages.razor:2` |
**Description**
`NavMenu` renders the "Event Logs" and "Parked Messages" links inside the
all-authenticated-users Monitoring section. The design doc classifies both the
Site Event Log Viewer and Parked Message Management as **Deployment Role**.
Two inconsistencies result: (a) an Admin- or Design-only user sees nav links
they cannot use; (b) the pages themselves are annotated only `[Authorize]`
(any authenticated user), not `[Authorize(Policy = RequireDeployment)]`, so a
non-Deployment user who follows the link is *not* blocked — they can query site
event logs and retry/discard parked messages. The authorization attribute and
the nav visibility both contradict the design.
**Recommendation**
Add `[Authorize(Policy = AuthorizationPolicies.RequireDeployment)]` to
`EventLogs.razor` and `ParkedMessages.razor`, and move their nav links into a
`<AuthorizeView Policy="RequireDeployment">` block (consistent with the Topology
/ Deployments / Debug View links). Confirm Health Dashboard is intentionally
all-roles (it is, per the design).
**Resolution**
_Unresolved._
### CentralUI-008 — Audit-log date filters treat browser-local datetimes as UTC
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Monitoring/AuditLog.razor:242-243` |
**Description**
The `From`/`To` filters bind `<input type="datetime-local">` to `DateTime?`
fields. A `datetime-local` input yields the value the user typed in their
*browser-local* time zone. `FetchPage` converts them with
`new DateTimeOffset(_filterFrom.Value, TimeSpan.Zero)` — i.e. it labels the
local wall-clock value as UTC. For any non-UTC user the audit query window is
shifted by their UTC offset, silently returning the wrong rows. CLAUDE.md
mandates UTC throughout, but that requires converting the local input *to* UTC,
not relabelling it.
**Recommendation**
Convert the picked local time to UTC before querying — capture the browser
offset (JS interop) and apply it, or document the inputs as UTC and label them
in the UI. The same issue should be checked in `EventLogs.razor` if it has
time-range filters.
**Resolution**
_Unresolved._
### CentralUI-009 — `DebugView` stream callbacks touch a possibly-disposed `ToastNotification`
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Deployment/DebugView.razor:400-409,538-544` |
**Description**
The `onTerminated` callback passed to `DebugStreamService.StartStreamAsync`
captures `_toast` and `this` and runs on an Akka/gRPC thread. If the user
navigates away, `Dispose()` calls `StopStream`, but a stream-termination event
already in flight can still invoke `onTerminated`, which calls
`_toast.ShowError(...)` and `StateHasChanged()` on a disposed component. The
component does not guard callbacks with a disposed flag or a
`CancellationTokenSource`. The same applies to the `onEvent` callbacks at
lines 391-398 that call `InvokeAsync(StateHasChanged)`.
**Recommendation**
Track a `_disposed`/`CancellationTokenSource` on the component, check it at the
top of every stream callback, and stop the stream synchronously before marking
disposed. `InvokeAsync` after disposal throws `ObjectDisposedException`; the
callbacks should no-op once disposed.
**Resolution**
_Unresolved._
### CentralUI-010 — `ToastNotification` auto-dismiss continuation runs after component disposal
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Shared/ToastNotification.razor:62-71,90` |
**Description**
`AddToast` schedules `Task.Delay(dismissMs).ContinueWith(...)` with the result
discarded (`_ =`). The continuation calls `InvokeAsync(StateHasChanged)`. If the
host page is disposed before the 5-second delay elapses (common — navigate away
right after an action), the continuation runs against a disposed component and
`InvokeAsync` throws `ObjectDisposedException` on a thread-pool thread with no
catch, producing an unobserved task exception. `Dispose()` is an empty body and
cancels nothing.
**Recommendation**
Hold a `CancellationTokenSource`, pass its token to `Task.Delay`, cancel it in
`Dispose()`, and guard the continuation. Alternatively wrap the continuation
body in a try/catch for `ObjectDisposedException`.
**Resolution**
_Unresolved._
### CentralUI-011 — `DiffDialog` leaves a dangling `TaskCompletionSource` when disposed while open
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Shared/DiffDialog.razor:89-95,151-157` |
**Description**
`OpenAsync` creates `_tcs` and returns `_tcs.Task` to the caller, which
typically `await`s it. The task is completed only by `Close()`. If the user
navigates away while the dialog is open, `DisposeAsync` runs but never completes
`_tcs`, so the awaiting caller's continuation never resumes — a permanently
suspended `Task` (and any `using`/cleanup after the await is skipped). The
`IDialogService.Confirm/Prompt` path has the same shape but at least its host
is a single long-lived `DialogHost`; `DiffDialog` is per-page.
**Recommendation**
In `DisposeAsync`, call `_tcs?.TrySetResult(false)` (or `TrySetCanceled`) so any
awaiter completes deterministically.
**Resolution**
_Unresolved._
### CentralUI-012 — N+1 query loading data connections for the Sites page
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Pages/Admin/Sites.razor:196-205` |
**Description**
`LoadDataAsync` fetches all sites, then issues
`SiteRepository.GetDataConnectionsBySiteIdAsync(site.Id)` once per site in a
loop. With N sites this is N+1 database round-trips on every page load and every
post-delete refresh. The connection lists are only used for a small per-card
summary.
**Recommendation**
Add a repository method that returns all data connections (or connections for a
set of site ids) in one query and group them client-side, or project the small
summary in a single query.
**Resolution**
_Unresolved._
### CentralUI-013 — `ScriptAnalysisService` blocks on async shared-script lookups
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:951-952` |
**Description**
`ResolveCalledShape` calls `_sharedScripts.GetShapesAsync().GetAwaiter().GetResult()`
to resolve a shared-script shape synchronously. `GetShapesAsync` ultimately hits
`SharedScriptService` and its EF Core repository. Sync-over-async on a request
thread risks thread-pool starvation under load and can deadlock if any awaited
continuation needs a captured context. `Hover` and `SignatureHelp` (which call
`ResolveCalledShape`) are themselves synchronous methods, so the blocking call
is structural.
**Recommendation**
Make `Hover` and `SignatureHelp` async and `await` `GetShapesAsync`, or have the
catalog expose a cached synchronous snapshot that is refreshed asynchronously.
The `IMemoryCache` is already present — caching the shapes there and reading
them synchronously would remove the blocking call.
**Resolution**
_Unresolved._
### CentralUI-014 — Test Run side effects (HTTP/SQL/SMTP) fire against production services
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/ScriptAnalysis/ScriptAnalysisService.cs:254-259`; `src/ScadaLink.CentralUI/ScriptAnalysis/SandboxHostHelpers.cs:26-117` |
**Description**
By design (documented in the XML comments) Test Run wires `ExternalSystem`,
`Database`, and `Notify` to central's *real* `IExternalSystemClient`,
`IDatabaseGateway`, and `INotificationDeliveryService`, so a Test Run that calls
`Notify.To(...).Send(...)` actually emails recipients, `Database.Connection(...)`
opens a real DB connection, and `External.Call(...)` makes real HTTP calls —
with production-equivalent side effects. There is no dry-run mode, no
confirmation, and (combined with CentralUI-001) no restriction on what a script
can do. A Design user testing a draft script can dispatch real notifications or
mutate external databases. The behaviour is intentional but the blast radius is
not surfaced to the user.
**Recommendation**
At minimum, surface a clear warning in the Test Run UI that side effects are
real, and require explicit opt-in for side-effecting calls. Preferably offer a
dry-run mode that stubs the helpers, defaulting to dry-run.
**Resolution**
_Unresolved._
### CentralUI-015 — `DialogService` continuations resolve off the render thread
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/ServiceCollectionExtensions.cs:24`; `src/ScadaLink.CentralUI/Components/Shared/DialogService.cs:18-69` |
**Description**
`DialogService` is `AddScoped` (one per circuit, correct) but
`ConfirmAsync`/`PromptAsync` complete via `ContinueWith(..., TaskScheduler.Default)`,
so a caller awaiting them resumes on a thread-pool thread. Any subsequent
component state mutation by the caller is then off the render thread unless the
caller wraps it in `InvokeAsync`. Call sites are not consistently doing so,
which can produce non-deterministic render glitches.
**Recommendation**
Either resolve continuations on the circuit's sync context or document that
callers must `InvokeAsync` after awaiting `ConfirmAsync`/`PromptAsync`. Audit
call sites for off-thread state mutation.
**Resolution**
_Unresolved._
### CentralUI-016 — Pagers render one button per page with no windowing
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Shared/DataTable.razor:62-68`; `src/ScadaLink.CentralUI/Components/Pages/Deployment/Deployments.razor:167-173` |
**Description**
The `DataTable` and `Deployments` paginators loop `for i = 1..totalPages` and
emit a `<li>` button for every page. With a few thousand records at page size 25
that is hundreds of buttons rendered into the diff on every state change. It is
not a correctness bug but degrades render performance and usability on large
datasets.
**Recommendation**
Window the pager (first / prev / a few around current / next / last) or switch
large lists to a "load more" / numeric jump input.
**Resolution**
_Unresolved._
### CentralUI-017 — `/auth/logout` POST disables antiforgery, enabling logout CSRF
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Auth/AuthEndpoints.cs:127-138` |
**Description**
The `POST /auth/logout` endpoint calls `.DisableAntiforgery()`, and a plain
`GET /logout` endpoint also signs the user out. Either can be triggered
cross-site (an `<img src="/logout">` or an auto-submitting form) to forcibly log
a user out. Login itself reasonably disables antiforgery (pre-auth), but logout
is a state-changing authenticated action and should be CSRF-protected.
**Recommendation**
Require an antiforgery token on `POST /auth/logout` (the `NavMenu` sign-out form
can include the antiforgery token), and remove or protect the state-changing
`GET /logout` route.
**Resolution**
_Unresolved._
### CentralUI-018 — Broad `catch {}` blocks swallow JS interop and storage errors silently
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.CentralUI/Components/Shared/MonacoEditor.razor:116-118,123,142,164,170,176,182,189`; `src/ScadaLink.CentralUI/Components/Shared/TreeView.razor:129,139`; `src/ScadaLink.CentralUI/Components/Pages/Admin/Sites.razor:316-319` |
**Description**
Numerous `try { ... } catch { }` blocks swallow every exception with no logging.
The prerender-time JS-unavailable case is legitimate, but these catches also
hide real failures: a genuine Monaco init failure, or a clipboard permission
error become invisible. In `TreeView.razor` the storage-restore
`JsonSerializer.Deserialize` (line 139) is not inside a try at all and would
throw uncaught on a corrupt `treeviewStorage` payload. Debugging UI issues in
production is then guesswork.
**Recommendation**
Catch the specific expected exception type (e.g. `JSDisconnectedException`,
`InvalidOperationException` during prerender) and log anything else via
`ILogger`. Wrap the TreeView storage `Deserialize` in its own guarded block.
**Resolution**
_Unresolved._
### CentralUI-019 — Sparse unit-test coverage for a large module; critical paths untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.CentralUI.Tests/` |
**Description**
The module has ~65 source files but unit tests cover only the script analyzer,
TreeView, schema model, and two data-connection pages. Untested critical paths
include: the auth bridge (`CookieAuthenticationStateProvider`,
`AuthEndpoints`), `RunInSandboxAsync` (timeout, recursion limit, error
classification, side-effect wiring), `DialogService` resolution semantics,
`DebugView` stream lifecycle and the `UpsertWithCap` cap logic, `Health` and
`Deployments` timer behaviour, and `SchemaBuilderModel` round-tripping of nested
schemas. Given findings CentralUI-001/003/009/010 sit on untested code, the gap
is material. The Playwright suite covers login and navigation only.
**Recommendation**
Add bUnit/unit tests for the auth bridge, sandbox-run behaviour (including
forbidden-API rejection once CentralUI-001 is fixed), dialog resolution, and the
DebugView cap/lifecycle logic. Prioritise the paths named in the Critical/High
findings.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,313 @@
# Code Review — ClusterInfrastructure
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.ClusterInfrastructure` |
| Design doc | `docs/requirements/Component-ClusterInfrastructure.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 8 |
## Summary
The ClusterInfrastructure module is currently a **Phase 0 skeleton**. It contains
only two source files: `ClusterOptions.cs`, a plain options POCO, and
`ServiceCollectionExtensions.cs`, whose two registration methods are explicit no-ops.
None of the responsibilities described in `Component-ClusterInfrastructure.md`
Akka.NET cluster bootstrap, leader election, failover detection, split-brain
resolution, cluster singleton hosting, Windows service lifecycle — are implemented.
There are therefore no correctness, concurrency, or Akka-convention defects to find
in *behaviour*, because there is no behaviour. The findings below instead concern
(a) the large gap between the design doc and the code, (b) the options class missing
the validation, configuration-binding affordances, and coverage of documented
settings that peer modules provide, and (c) the no-op DI extensions silently
returning success, which is a latent reliability hazard once the Host wires this
module in. The dominant theme is **incompleteness**: this module is the foundation
every other component runs on, yet it presently delivers nothing the design requires.
The single options class is clean and its test covers defaults and setters
adequately for what exists.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | No executable logic exists beyond an options POCO; no logic bugs, but `ServiceCollectionExtensions` returns success while doing nothing (CI-002). |
| 2 | Akka.NET conventions | ✓ | No actors, no `ActorSystem` bootstrap, no supervision, no cluster/singleton wiring exist despite the design doc requiring all of them (CI-001). Nothing to assess against `Tell`/`Ask`, immutability, or `PipeTo`. |
| 3 | Concurrency & thread safety | ✓ | No shared mutable state, no actors, no async code. No issues found in current code. |
| 4 | Error handling & resilience | ✓ | Failover, split-brain, dual-node recovery, and graceful-shutdown logic are entirely absent (CI-001). No exception paths to review in current code. |
| 5 | Security | ✓ | No authn/authz surface in this module. Akka remoting is unconfigured, so transport security cannot be assessed; flagged as part of the missing implementation (CI-001). No secret handling present. |
| 6 | Performance & resource management | ✓ | No streams, connections, timers, or `IDisposable` resources exist yet. No issues found in current code. |
| 7 | Design-document adherence | ✓ | Severe drift: the module implements none of its documented responsibilities (CI-001). `ClusterOptions` also omits remoting host/port, cluster role/site identifier, gRPC port, storage paths, and `down-if-alone` (CI-003). |
| 8 | Code organization & conventions | ✓ | Options class is correctly owned by the component project. Missing config-section-name constant (CI-005) and missing `IValidateOptions`/data-annotation validation (CI-004) versus the Options pattern intent. |
| 9 | Testing coverage | ✓ | `ClusterOptionsTests` covers defaults and setters. No tests for any cluster behaviour because none exists; the test project references nothing else (CI-006). |
| 10 | Documentation & comments | ✓ | `ClusterOptions` has no XML doc comments unlike peer options classes (CI-007). The "Phase 0 skeleton" placeholders are undocumented at the module level — no README or tracking note (CI-008). |
## Findings
### ClusterInfrastructure-001 — Module implements none of its documented responsibilities
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
**Description**
`Component-ClusterInfrastructure.md` assigns this module seven concrete
responsibilities: bootstrap the Akka.NET `ActorSystem`, form the two-node cluster,
manage leader election / active-standby role assignment, detect node failures and
trigger failover, provide remoting, host the cluster singleton, and manage the
Windows service lifecycle. The entire module is two files: a `ClusterOptions` POCO
and a `ServiceCollectionExtensions` whose methods are explicitly commented
`// Phase 0: skeleton only` and `// Phase 0: placeholder for Akka actor registration`
and simply return the unmodified `IServiceCollection`. There is no `Akka.Cluster`,
`Akka.Cluster.Tools`, `Akka.Remote`, or split-brain-resolver dependency in the
`.csproj` at all (it references only `Microsoft.Extensions.DependencyInjection.Abstractions`,
`Microsoft.Extensions.Options`, and `ScadaLink.Commons`). Because every other
ScadaLink component runs inside the actor system this module is responsible for
creating, the absence of any implementation blocks the foundational layer of the
system.
**Recommendation**
Track the gap explicitly (a milestone/issue) and implement the documented behaviour:
add the Akka cluster/remote/cluster-tools and split-brain-resolver package
references, build the cluster bootstrap (HOCON generation from `ClusterOptions`),
the split-brain resolver configuration, cluster-singleton hosting support, and
`CoordinatedShutdown` wiring. Until then, the module's `Status` and the design doc
should clearly state it is unimplemented so callers do not assume otherwise.
**Resolution**
_Unresolved._
### ClusterInfrastructure-002 — No-op DI extension methods report success while doing nothing
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:7-17` |
**Description**
`AddClusterInfrastructure` and `AddClusterInfrastructureActors` both accept an
`IServiceCollection` and return it unchanged. A caller (e.g. the Host) that invokes
`services.AddClusterInfrastructure()` receives a fluent, success-looking result but
no actor system, no cluster, and no singleton support is actually registered. This
is a silent failure: the system will appear to start, then fail later and far from
the cause (e.g. when a component resolves an `ActorSystem` that was never added, or
when the cluster singleton never forms). A no-op that masquerades as a completed
registration is worse than an unimplemented method that throws.
**Recommendation**
Until the real implementation exists, make the placeholder loud rather than silent —
either throw `NotImplementedException` from the methods, or have them log a
prominent warning, so an integrating caller fails fast with a clear cause. Replace
with the genuine registration when CI-001 is addressed.
**Resolution**
_Unresolved._
### ClusterInfrastructure-003 — ClusterOptions omits several documented node-configuration settings
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
**Description**
The "Node Configuration", "Split-Brain Resolution", and "Failure Detection Timing"
sections of the design doc enumerate the settings each node needs. `ClusterOptions`
exposes `SeedNodes`, `SplitBrainResolverStrategy`, `StableAfter`,
`HeartbeatInterval`, `FailureDetectionThreshold`, and `MinNrOfMembers`, but is
missing: the Akka remoting hostname/port (default 8081 central, 8082 site), the
cluster role (Central vs. Site) and the site identifier, the `down-if-alone` flag
(the design explicitly requires `down-if-alone = on` for the keep-oldest resolver),
and — for site nodes — the gRPC port (default 8083) and local SQLite storage paths.
Without these, the options class cannot drive a correct HOCON configuration when
CI-001 is implemented. (Some settings such as remoting host/port may instead belong
in `Host/NodeOptions.cs`; the split of ownership should be decided deliberately, but
at minimum `down-if-alone` belongs with the split-brain settings here.)
**Recommendation**
Add the missing settings — at minimum a `DownIfAlone` boolean (default `true`) and
the cluster role / site identifier — or document explicitly which settings are
owned by `Host/NodeOptions.cs` instead, so the design doc and the options classes
agree on where each value lives.
**Resolution**
_Unresolved._
### ClusterInfrastructure-004 — ClusterOptions has no validation despite safety-critical values
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
**Description**
`ClusterOptions` carries values whose misconfiguration has cluster-wide
consequences. The design doc is emphatic that `min-nr-of-members` must be `1` (a
value of `2` blocks the singleton and therefore all data collection indefinitely
after failover), that `SplitBrainResolverStrategy` must be `keep-oldest` for a
two-node cluster (quorum strategies cause total shutdown), and that the timing
values are interdependent (`HeartbeatInterval` must be well below
`FailureDetectionThreshold`). The class has no data annotations, no
`IValidateOptions<ClusterOptions>`, and no guard logic, so an `appsettings.json`
setting `MinNrOfMembers: 2` or `SplitBrainResolverStrategy: "keep-majority"` (the
exact value the test at `ClusterOptionsTests.cs:35` shows is settable) would be
accepted silently and produce the catastrophic outcomes the design doc warns
against.
**Recommendation**
Add validation — data annotations (`[Range]` for `MinNrOfMembers`, etc.) plus an
`IValidateOptions<ClusterOptions>` implementation that enforces
`MinNrOfMembers == 1`, restricts `SplitBrainResolverStrategy` to a known set,
requires `SeedNodes` non-empty, and asserts `HeartbeatInterval <
FailureDetectionThreshold` and positive `StableAfter`. Register it with
`ValidateOnStart()` so misconfiguration fails fast at boot.
**Resolution**
_Unresolved._
### ClusterInfrastructure-005 — No configuration section name constant for the Options pattern binding
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3` |
**Description**
CLAUDE.md specifies per-component configuration via `appsettings.json` sections
bound with the Options pattern. `ClusterOptions` provides no `public const string
SectionName` (or equivalent) for the binding site to reference, so whichever code
binds the section must hard-code the magic string, and there is no single source of
truth for the section name. Because `AddClusterInfrastructure` is itself a no-op
(CI-002), the options class is currently bound nowhere at all, making the missing
constant easy to overlook.
**Recommendation**
Add a `public const string SectionName = "Cluster";` (or the agreed name) to
`ClusterOptions` and have the eventual `AddClusterInfrastructure` bind
`configuration.GetSection(ClusterOptions.SectionName)` against it.
**Resolution**
_Unresolved._
### ClusterInfrastructure-006 — No tests for any cluster behaviour; only the options POCO is covered
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ClusterInfrastructure.Tests/ClusterOptionsTests.cs:1-51` |
**Description**
The test project contains only `ClusterOptionsTests`, exercising default values and
property setters of `ClusterOptions`. There are no tests for cluster formation,
leader election, failover detection, split-brain resolution, singleton handover, or
the `ServiceCollectionExtensions` registration methods — none can exist because the
behaviour itself is absent (CI-001). This is recorded so the testing gap is tracked
alongside the implementation gap: the most safety-critical paths of the entire
system (failover, split-brain, dual-node recovery) are completely untested. The
test at line 30-50 also asserts that `SplitBrainResolverStrategy` can be set to
`"keep-majority"`, implicitly endorsing a value the design doc forbids for a
two-node cluster — see CI-004.
**Recommendation**
When CI-001 is implemented, add multi-node `Akka.Cluster.TestKit` /
`MultiNodeTestKit` tests covering cluster formation, failover promotion,
split-brain downing, and singleton handover, plus unit tests for HOCON generation
from `ClusterOptions` and for the options validation from CI-004.
**Resolution**
_Unresolved._
### ClusterInfrastructure-007 — ClusterOptions lacks XML documentation comments
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:3-11` |
**Description**
`ClusterOptions` and each of its six properties have no XML doc comments. Peer
options classes such as `StoreAndForward/StoreAndForwardOptions.cs` document the
class and every property (including units and design-doc references). For a class
whose values carry the cluster-wide consequences described in the design doc
(notably `MinNrOfMembers` and `SplitBrainResolverStrategy`), the absence of inline
documentation is a maintainability and safety gap — a future editor has no in-code
warning that `MinNrOfMembers` must stay `1`.
**Recommendation**
Add `<summary>` comments to the class and each property, stating units and the
documented constraints (e.g. that `MinNrOfMembers` must be `1`, that
`HeartbeatInterval` must be well below `FailureDetectionThreshold`), referencing
the relevant design-doc sections as peer modules do.
**Resolution**
_Unresolved._
### ClusterInfrastructure-008 — "Phase 0 skeleton" status is undocumented at the module level
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:9`, `src/ScadaLink.ClusterInfrastructure/ServiceCollectionExtensions.cs:16` |
**Description**
The only indication that this foundational module is unimplemented is two inline
comments inside private method bodies (`// Phase 0: skeleton only` /
`// Phase 0: placeholder for Akka actor registration`). There is no module README,
no `<!-- TODO -->` in the design doc, and no tracking marker visible to anyone
reading the project structure or the component table. Given that the design doc
(`Component-ClusterInfrastructure.md`) describes a fully featured component with no
caveat, a reader will reasonably assume the module is built. The mismatch between a
complete-looking design doc and an empty implementation is itself a documentation
defect.
**Recommendation**
Add a short note to the design doc (or a module-level `README.md`) stating the
current implementation status and what "Phase 0" delivers, and reference a tracked
issue for the remaining work (CI-001). Keep the README component table accurate
about which components are skeletons versus implemented.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,448 @@
# Code Review — Commons
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.Commons` |
| Design doc | `docs/requirements/Component-Commons.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 12 |
## Summary
Commons is in good overall health. It is a well-organized, dependency-light library:
the architectural-constraint tests enforce the no-Akka/no-EF/no-ASP.NET rule, the
POCO-entity and message-as-record conventions, and the UTC timestamp rule. The folder
and namespace hierarchy closely matches REQ-COM-5b. No Critical issues were found.
The findings cluster around three themes. First, a handful of files quietly stretch
the REQ-COM-6 "no business logic" boundary — `StaleTagMonitor`, `OpcUaEndpointConfigSerializer`,
`OpcUaEndpointConfigValidator`, `ScriptParameters`, `ValueFormatter`, `DynamicJsonElement`
and `ScriptArgs` all carry non-trivial behavior, and a couple have real correctness or
concurrency defects (the `StaleTagMonitor` stale-fire race, the `DynamicJsonElement`
`JsonDocument`-lifetime hazard, the silent conversion-failure swallowing in
`ScriptParameters.GetNullable`). Second, the `ManagementCommandRegistry` name mapping is
asymmetric and namespace-scoped in a way that does not match the broader set of
`*Command` records elsewhere in `Messages/`. Third, several behavior-bearing types
(`ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`,
`Result<T>`, the OPC UA serializer round-trip) have no unit tests despite containing the
kind of edge-case logic that warrants them. Entity and message contracts otherwise look
clean and additive-evolution-friendly, with the exception of one `ValueTuple` use in a
wire command.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `DynamicJsonElement.TryConvert` returns success for non-convertible types; `Result<T>` allows null error; legacy-config fallback loses data. |
| 2 | Akka.NET conventions | ✓ | Commons has no actors (correct). Message contracts are records and immutable. One wire message uses `ValueTuple` (Commons-008). Correlation IDs present on request/response messages. |
| 3 | Concurrency & thread safety | ✓ | `StaleTagMonitor` has a check-then-act race between the timer callback and `OnValueReceived` (Commons-001). |
| 4 | Error handling & resilience | ✓ | `ScriptParameters.GetNullable` silently swallows conversion failures (Commons-003); OPC UA legacy deserialize discards malformed input (Commons-005). |
| 5 | Security | ✓ | No auth logic here. `SmtpConfiguration.Credentials` / OPC UA passwords are plain-string fields (storage/encryption is a consumer concern) — noted, not a finding. No script-trust violations: Commons defines no forbidden-API surface. |
| 6 | Performance & resource management | ✓ | `StaleTagMonitor` disposes its `Timer` correctly. `DynamicJsonElement` references a `JsonElement` whose backing document lifetime is not owned (Commons-002). |
| 7 | Design-document adherence | ✓ | Several behavior-bearing helper/validator/serializer classes push against REQ-COM-6 "no business logic" (Commons-007). Folder layout matches REQ-COM-5b. |
| 8 | Code organization & conventions | ✓ | `ManagementCommandRegistry` naming is asymmetric/namespace-scoped (Commons-004). `DeployedConfigSnapshot`, `InstanceAlarmOverride`, `TemplateFolder`, `ISiteRepository`, several service interfaces and `Messages/Management` exist but are not listed in Component-Commons.md (Commons-009). |
| 9 | Testing coverage | ✓ | `ValueFormatter`, `DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, `Result<T>`, `ConfigurationDiff`, `AlarmContext`, and the OPC UA serializer round-trip have no tests (Commons-010). |
| 10 | Documentation & comments | ✓ | `OpcUaEndpointConfigSerializer.Deserialize` XML doc does not mention the silent data-loss path (Commons-005). `Component-Commons.md` is stale relative to the actual file set (Commons-009). `ValueFormatter` uses current-culture formatting without documenting it (Commons-012). |
## Findings
### Commons-001 — `StaleTagMonitor` stale-fire race between timer and `OnValueReceived`
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/StaleTagMonitor.cs:42-46`, `:62-67` |
**Description**
`OnValueReceived` sets `_staleFired = false` then calls `_timer.Change(...)`, while the
timer callback `OnTimerElapsed` reads `_staleFired`, sets it to `true`, and invokes the
`Stale` event. `_staleFired` is `volatile`, which guarantees visibility but not
atomicity of the check-then-set. The two methods run on different threads (a value-
arrival thread and a `ThreadPool` timer thread). If the timer callback has already
passed the `if (_staleFired) return;` check when `OnValueReceived` runs, `Stale` fires
even though a fresh value just arrived — a spurious staleness signal. There is also a
window where `OnValueReceived` resets `_staleFired` and reschedules the timer while a
callback for the previous period is mid-flight, so `Stale` can fire once per period as
documented but at the wrong moment. For a heartbeat monitor feeding connection-health
decisions, a false stale signal can trigger an unnecessary reconnect.
**Recommendation**
Guard the state transition with a lock, or replace the `_staleFired` bool with an
`Interlocked.CompareExchange` on an `int` so only one of "fire" / "reset" wins. The
callback should atomically test-and-set; `OnValueReceived` should atomically reset and
only then reschedule the timer.
**Resolution**
_Unresolved._
### Commons-002 — `DynamicJsonElement` retains a `JsonElement` whose `JsonDocument` lifetime it does not own
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:10-17` |
**Description**
`DynamicJsonElement` stores a `JsonElement` and exposes it for deferred, dynamic access
from scripts. A `JsonElement` is only valid while the `JsonDocument` that produced it has
not been disposed; accessing a `JsonElement` after its document is disposed throws
`ObjectDisposedException`. Nothing in `DynamicJsonElement` keeps the document alive or
documents that the caller must. Because the wrapper is explicitly designed for
"convenient property access in scripts" — i.e. access at an arbitrary later time — a
caller that wraps an element from a `using var doc = JsonDocument.Parse(...)` block (the
exact pattern used in `OpcUaEndpointConfigSerializer`) will hand scripts a wrapper that
faults on first member access.
**Recommendation**
Either clone the element on construction with `JsonElement.Clone()` (which detaches it
from the document and makes it safe to retain), or hold a reference to the owning
`JsonDocument` and implement `IDisposable`. Document the lifetime contract on the type
regardless.
**Resolution**
_Unresolved._
### Commons-003 — `ScriptParameters.GetNullable` silently swallows conversion failures
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/ScriptParameters.cs:72-86` |
**Description**
`GetNullable<T>` catches `ScriptParameterException` from `ConvertScalar` and returns
`default!` (null) "on conversion failure for nullable". This conflates two distinct
cases: a parameter that is genuinely absent/null, and a parameter that is *present but
holds an unconvertible value* (e.g. `Get<int?>("count")` when `count` is the string
`"banana"`). The latter is almost always a script or caller bug, and silently mapping it
to `null` hides it — the script then proceeds with a null it interprets as "not
supplied". The non-nullable `Get<T>` and the array/list paths correctly throw with a
descriptive message for the same bad input, so the behavior is also inconsistent across
the API surface. The XML doc states "returns null if missing, null, or unconvertible",
so the behavior is intentional, but it remains a footgun.
**Recommendation**
Distinguish "absent/null" from "present but unconvertible": return null only for the
former and throw `ScriptParameterException` for the latter, mirroring the array/list
element handling. If the swallowing must stay for compatibility, at minimum surface it
(e.g. an out-of-band warning) rather than failing silently.
**Resolution**
_Unresolved._
### Commons-004 — `ManagementCommandRegistry` name mapping is asymmetric and namespace-scoped
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Commons/Messages/Management/ManagementCommandRegistry.cs:14-35` |
**Description**
`BuildRegistry` registers only types in the exact `ScadaLink.Commons.Messages.Management`
namespace whose names end in `Command`. `GetCommandName(Type)`, however, strips a
`Command` suffix from *any* type passed to it. The two halves disagree:
- `GetCommandName` will happily compute a command name for `*Command` records that live
in other `Messages/` sub-namespaces (`DeployInstanceCommand` in `Messages.Deployment`,
`DisableInstanceCommand` in `Messages.Lifecycle`, `SetStaticAttributeCommand` in
`Messages.Instance`, `DeployArtifactsCommand` in `Messages.Artifacts`, etc.), yet
`Resolve` will return `null` for every one of those names because they were never
registered.
- Because of this gap the Management namespace carries deliberately renamed duplicates
(`MgmtDeployInstanceCommand`, `MgmtEnableInstanceCommand`, `MgmtDisableInstanceCommand`,
`MgmtDeleteInstanceCommand` in `InstanceCommands.cs`) whose `Mgmt` prefix exists only
to dodge a collision the registry's namespace filter already prevents — a confusing,
undocumented coupling.
A round-trip `Resolve(GetCommandName(t))` is therefore not guaranteed to return `t`,
which is the implicit contract of a name registry.
**Recommendation**
Make the two methods symmetric: either scan all of `Messages/` (and detect/throw on
duplicate stripped names, since `ToFrozenDictionary` will throw on a collision) or
restrict `GetCommandName` to types the registry actually contains. Document the chosen
scope, and reconsider whether the `Mgmt*` prefixed duplicates are still needed.
**Resolution**
_Unresolved._
### Commons-005 — `OpcUaEndpointConfigSerializer.Deserialize` discards malformed legacy input and over-reports `IsLegacy`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs:25-51` |
**Description**
When the typed-deserialize path fails or the JSON lacks `endpointUrl`, `Deserialize`
falls through to `LoadLegacy`. If `LoadLegacy` itself throws `JsonException` (genuinely
malformed JSON), the method returns `(new OpcUaEndpointConfig(), IsLegacy: true)` — a
default, empty config with the legacy flag set. The original stored string is silently
discarded, and the caller is told it is a recoverable "legacy" row when in fact the data
was unparseable. A form built on the documented `IsLegacy` contract ("prompt the user to
re-save") will present an empty config as if it were the user's saved configuration,
inviting them to overwrite real (if malformed) data with blanks. The XML doc only
describes the happy legacy path and does not mention this data-loss branch.
**Recommendation**
Distinguish "parsed as legacy" from "could not parse at all" — e.g. return a third state
or throw for genuinely malformed input so the caller can surface an error instead of an
empty form. Update the XML doc to describe the failure branch.
**Resolution**
_Unresolved._
### Commons-006 — `DynamicJsonElement.TryConvert` reports success for unconvertible target types
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/DynamicJsonElement.cs:47-51`, `:66-76` |
**Description**
`TryConvert` does `result = ConvertTo(binder.Type); return result != null || binder.Type == typeof(object);`.
`ConvertTo` returns `null` for any type/kind pair it does not handle (e.g. requesting
`int` from a JSON string, or `DateTime` from anything). For a non-`object` target this
yields `result == null` and `return false`, which is correct. But the `|| binder.Type == typeof(object)`
clause makes `(object)dynamicElement` succeed with a `null` result even when the wrapped
element is, say, a JSON object or a non-null string — the cast silently produces `null`
instead of the element or its value. Any script doing `object o = jsonThing;` gets `null`
for a present value. The conversion of a present, non-null JSON value should never yield
`null`.
**Recommendation**
For the `object` target, return the element itself (or `Wrap(_element)`) rather than
`null`. Only return `null` when the wrapped element is genuinely `JsonValueKind.Null`.
**Resolution**
_Unresolved._
### Commons-007 — Several Commons types carry non-trivial logic, stretching REQ-COM-6
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/ScriptParameters.cs`, `src/ScadaLink.Commons/Serialization/OpcUaEndpointConfigSerializer.cs`, `src/ScadaLink.Commons/Validators/OpcUaEndpointConfigValidator.cs`, `src/ScadaLink.Commons/Types/StaleTagMonitor.cs`, `src/ScadaLink.Commons/Types/ScriptArgs.cs` |
**Description**
REQ-COM-6 states Commons "must contain only data structures, interfaces, enums, and
constants" and "must not contain any business logic", with method bodies "limited to
trivial data-access logic". Several files exceed that: `ScriptParameters` performs typed
conversion with reflection and JSON-element unwrapping; `OpcUaEndpointConfigSerializer`
implements a multi-shape (typed + legacy flat-dict) serialization strategy;
`OpcUaEndpointConfigValidator` encodes OPC UA domain rules (e.g. `LifetimeCount` ≥ 3×
`KeepAliveCount`); `StaleTagMonitor` runs a `Timer` and raises events; `ScriptArgs`
reflects over arbitrary objects. The `ArchitecturalConstraintTests` "no service/actor"
heuristic only counts public methods (> 3) and so does not catch these. This is design
drift, not a defect — but it should be a deliberate decision: either move these helpers
into the components that own the behavior (Data Connection Layer, Site Runtime,
Template Engine) or amend Component-Commons.md to explicitly permit "pure stateless
helpers/validators".
**Recommendation**
Decide and document the policy. If these are intentionally allowed in Commons, add a
sentence to REQ-COM-6 carving out pure validators/serializers/parsers; otherwise relocate
them. Tighten the architectural test if the rule is meant to be enforced.
**Resolution**
_Unresolved._
### Commons-008 — `SetConnectionBindingsCommand` uses `ValueTuple` in a wire message contract
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Commons/Messages/Management/InstanceCommands.cs:10` |
**Description**
`SetConnectionBindingsCommand` declares
`IReadOnlyList<(string AttributeName, int DataConnectionId)> Bindings`. The tuple element
names are compile-time-only; `System.Text.Json` serializes a `ValueTuple` as `Item1` /
`Item2`, and the message is positional with no room for additive evolution (you cannot
add a third field without changing the tuple type, which REQ-COM-5a forbids). Every other
message in `Messages/` uses named records. A management command travels over the
ClusterClient boundary and is exactly the kind of contract REQ-COM-5a's additive-only
rule targets.
**Recommendation**
Replace the tuple with a small named record, e.g.
`record ConnectionBinding(string AttributeName, int DataConnectionId)`, and use
`IReadOnlyList<ConnectionBinding>`.
**Resolution**
_Unresolved._
### Commons-009 — `Component-Commons.md` is stale relative to the actual file set
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `docs/requirements/Component-Commons.md:61-198` |
**Description**
The design doc's entity list, repository list, and folder tree no longer match the code:
- Entities present but undocumented: `DeployedConfigSnapshot`, `InstanceAlarmOverride`,
`TemplateFolder`.
- Repository interface present but undocumented: `ISiteRepository` (the doc lists seven
repositories under REQ-COM-4; the code has eight).
- Service interfaces present but undocumented: `IDatabaseGateway`,
`IExternalSystemClient`, `IInstanceLocator`, `INotificationDeliveryService` — REQ-COM-4a
documents only `IAuditService`.
- Whole namespaces absent from the REQ-COM-5b folder tree: `Messages/Management`,
`Messages/DataConnection`, `Messages/Integration`, `Messages/Instance`,
`Messages/RemoteQuery`, plus `Types/DataConnections`, `Types/Scripts`, `Serialization/`,
and `Validators/`.
CLAUDE.md's editing rules require the design docs to stay in sync with the code; the doc
is now a partial map.
**Recommendation**
Refresh Component-Commons.md to enumerate the current entities, repository and service
interfaces, and the actual `Types/`, `Messages/`, `Serialization/`, and `Validators/`
folders.
**Resolution**
_Unresolved._
### Commons-010 — Behavior-bearing Commons types have no unit tests
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Commons.Tests/` |
**Description**
`ScadaLink.Commons.Tests` covers `Result`, `RetryPolicy`, `ScriptParameters`,
`StaleTagMonitor`, the OPC UA validator, enums, message conventions, compatibility, and
entity conventions. It does not cover several types that contain exactly the kind of
edge-case logic that warrants tests:
- `ValueFormatter` — scalar vs collection vs null formatting.
- `DynamicJsonElement` — member/index access, conversions, the issues in Commons-002 and
Commons-006 would have been caught by tests.
- `ScriptArgs.Normalize` — dictionary/anonymous-object/primitive-rejection paths.
- `ManagementCommandRegistry``Resolve` / `GetCommandName` round-trip (would have
surfaced Commons-004).
- `Result<T>``Match`, failure/success accessors, error-on-misuse.
- `OpcUaEndpointConfigSerializer` typed↔flat round-trip and legacy fallback.
- `ConfigurationDiff` / `AlarmContext` / `ScriptScope` — minor, but `HasChanges` /
`HasParent` logic is untested.
**Recommendation**
Add focused unit tests for the helper/utility types above, prioritizing
`DynamicJsonElement`, `ScriptArgs`, `ManagementCommandRegistry`, and the OPC UA serializer
round-trip.
**Resolution**
_Unresolved._
### Commons-011 — `Result<T>.Failure` accepts a null error string
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/Result.cs:15-20`, `:30-32`, `:36` |
**Description**
`Result<T>.Failure(string error)` and the private failure constructor do not validate
`error`. A caller passing `null` produces a failed `Result` whose `Error` getter returns
`null` via `_error!`, and whose `Match` calls `onFailure(_error!)` with `null`. `Result`
is the system-wide error-handling type ("consistent error handling across component
boundaries"); a failed result with no error message defeats its purpose and pushes a
`NullReferenceException` risk onto every consumer that logs or displays `Error`.
**Recommendation**
Throw `ArgumentNullException` (or `ArgumentException` for empty/whitespace) in
`Failure`/the failure constructor so a failed `Result` always carries a message.
**Resolution**
_Unresolved._
### Commons-012 — `ValueFormatter` uses current-culture formatting without documenting it
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/ValueFormatter.cs:20-27` |
**Description**
`FormatDisplayValue` formats `IFormattable` values (and collection elements) with the
parameterless `ToString()`, which uses the current thread culture. The XML doc calls this
"the value's natural string representation" without noting the culture dependency. The
same numeric or `DateTime` attribute value will render differently depending on the
server/UI locale — e.g. decimal separators, date order. CLAUDE.md mandates UTC for
timestamps and notes local-time conversion is "a UI display concern only"; if
`ValueFormatter` is used outside a UI rendering context (e.g. logging, event-log entries,
diff display) the culture-dependent output is inconsistent and a latent bug.
**Recommendation**
Decide whether `ValueFormatter` is a UI-only helper. If it can be used outside the UI,
format with `CultureInfo.InvariantCulture` (using the `IFormattable.ToString(null, IFormatProvider)`
overload). Either way, document the culture behavior on the method.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,404 @@
# Code Review — Communication
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.Communication` |
| Design doc | `docs/requirements/Component-Communication.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
## Summary
The Communication module is generally well-structured and matches the design doc's
two-transport model (ClusterClient for command/control, gRPC server-streaming for
real-time data). The actors keep mutable state on the actor thread, use `PipeTo` for
async work, and the gRPC server/client lifecycle is mostly disciplined. However the
review found one Critical issue (a `TimeoutException` from `DebugStreamService` leaves
an orphaned bridge actor and an active site-side subscription, leaking resources on
every snapshot timeout) and several High/Medium issues clustered around two themes:
**(a) gRPC subscription bookkeeping races** — `SiteStreamGrpcClient` overwrites and
removes subscription entries by correlation ID without disposal or ownership checks,
so reconnect cycles leak `CancellationTokenSource`es and can cancel the wrong stream;
and **(b) missing supervision strategy** on the coordinator actors, contrary to the
CLAUDE.md "Resume for coordinator actors" decision. Design-doc adherence is otherwise
good. Test coverage is broad for happy paths but has gaps around failover, cache
mutation races, and the snapshot-timeout cleanup path.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Snapshot-timeout orphan, reconnect not calling `CleanupGrpc`, subscription-map races. |
| 2 | Akka.NET conventions | ✓ | No supervision strategy on coordinators; `Sender` captured in async-launched closure path. |
| 3 | Concurrency & thread safety | ✓ | `SiteStreamGrpcClient._subscriptions` overwrite/remove race; `_siteClients` field reassignment unused but non-readonly. |
| 4 | Error handling & resilience | ✓ | gRPC reconnect leaks server-side relay; `LoadSiteAddressesFromDb` swallows DB failures silently. |
| 5 | Security | ✓ | No findings in module code. DebugStreamHub auth lives outside this module (Central UI). |
| 6 | Performance & resource management | ✓ | Orphaned subscriptions/CTS leaks; `SiteStreamGrpcClientFactory.Dispose` blocks on async. |
| 7 | Design-document adherence | ✓ | `GrpcMaxStreamLifetime` / keepalive options defined but never applied; hard-coded values used instead. |
| 8 | Code organization & conventions | ✓ | Options pattern correct; minor: public records declared in actor files. No structural issues. |
| 9 | Testing coverage | ✓ | No tests for snapshot-timeout cleanup, address-cache refresh races, or gRPC server reconnect-leak. |
| 10 | Documentation & comments | ✓ | XML comment on `DebugStreamBridgeActor` says "Persistent actor" — it is not an Akka.Persistence actor. |
## Findings
### Communication-001 — Snapshot timeout leaves orphaned bridge actor and site subscription
| | |
|--|--|
| Severity | Critical |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Communication/DebugStreamService.cs:139`, `src/ScadaLink.Communication/DebugStreamService.cs:149` |
**Description**
When `StartStreamAsync` times out waiting for the initial snapshot it calls
`StopStream(sessionId)` and throws. `StopStream` only sends `StopDebugStream` to the
bridge actor **if the session is still in `_sessions`**. But the bridge actor was added
to `_sessions` at line 124 and is only removed by `onTerminatedWrapper`. The serious
case is the race where `onTerminatedWrapper` fires first (e.g. site disconnect arrives
during the wait): `snapshotTcs.TrySetException` completes the await with an
`InvalidOperationException` rather than `OperationCanceledException`, which is **not**
caught by the `catch (OperationCanceledException)` block. The exception propagates
uncaught, `StopStream` is never reached, and if the bridge actor is instead orphaned
(snapshot never arrives, site silent, no terminate) the only cleanup is the 5-minute
`ReceiveTimeout` in the actor — meaning a site-side `StreamRelayActor` and gRPC stream
can stay alive for up to 5 minutes after the central caller has given up. Combined with
the 30s timeout, every transient snapshot delay leaks site resources for minutes.
**Recommendation**
In `StartStreamAsync`, wrap the `await` so that *any* failure or cancellation
deterministically calls `StopStream(sessionId)` (e.g. `try/catch (Exception)` or a
`finally` that stops the session when the result was not returned). Ensure
`StopStream` is idempotent and always sends `StopDebugStream` even if the session was
already removed, so the bridge actor (and its site-side subscription) is torn down
promptly rather than waiting for the orphan `ReceiveTimeout`.
**Resolution**
_Unresolved._
### Communication-002 — gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:170`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:143` |
**Description**
On a gRPC stream error, `HandleGrpcError` increments the retry count, flips
`_useNodeA`, and schedules `OpenGrpcStream`. `OpenGrpcStream` cancels and disposes
`_grpcCts` and starts a fresh `SubscribeInstance` call — but it never calls
`client.Unsubscribe(_correlationId)` on the *old* node's client, and the site-side
`SiteStreamGrpcServer` keys active streams by `correlation_id` only. Because the new
subscription goes to the *other* node (`_useNodeA` flipped), the old node's
`SiteStreamGrpcServer` still has an active stream + `StreamRelayActor` +
`SiteStreamManager` subscription for that correlation ID. The old node only learns the
client is gone via TCP RST or keepalive — exactly the failure mode that triggered the
reconnect (network partition / silent node), so detection may take ~25s or never. Each
reconnect can therefore leave a zombie relay actor on the failed node. `CleanupGrpc`
(which *does* call `Unsubscribe`) is only invoked on terminal paths, not between
reconnect attempts.
**Recommendation**
Before reconnecting in `HandleGrpcError` / at the top of `OpenGrpcStream`, call
`Unsubscribe(_correlationId)` on the client for the *previous* endpoint (the one that
just failed) so the local CTS is cancelled and — where the channel is still alive —
the gRPC cancellation reaches the site and stops the relay actor.
**Resolution**
_Unresolved._
### Communication-003 — SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:77`, `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:106` |
**Description**
`SubscribeAsync` does `_subscriptions[correlationId] = cts;` (line 77),
unconditionally overwriting any existing entry for that correlation ID without
cancelling or disposing the previous `CancellationTokenSource`. The `finally` block
then does `_subscriptions.TryRemove(correlationId, out _)` (line 106) which removes
the entry **by key only, regardless of which CTS is stored**. Because
`DebugStreamBridgeActor` reuses the same `_correlationId` across reconnect attempts
(and `SiteStreamGrpcClientFactory` returns the same `SiteStreamGrpcClient` for a site
even after a node flip), two `SubscribeAsync` calls can briefly share a correlation
ID. The first call's `finally` then removes the *second* call's CTS entry, so a later
`Unsubscribe(correlationId)` finds nothing and the live stream is never cancelled — an
orphan. Conversely the overwritten CTS is leaked (never disposed).
**Recommendation**
When inserting, cancel+dispose any prior CTS for that correlation ID. In the `finally`,
remove only if the stored CTS is the one this call created (use the
`TryRemove(KeyValuePair)` overload, mirroring what `SiteStreamGrpcServer` already does
with `StreamEntry`). Consider keying subscriptions by a per-call GUID rather than the
caller-supplied correlation ID.
**Resolution**
_Unresolved._
### Communication-004 — Coordinator actors declare no SupervisorStrategy (design requires Resume)
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:42`, `src/ScadaLink.Communication/Actors/SiteCommunicationActor.cs:22` |
**Description**
CLAUDE.md ("Explicit supervision strategies: Resume for coordinator actors, Stop for
short-lived execution actors") requires coordinator actors to use an explicit `Resume`
supervision strategy. `CentralCommunicationActor` and `SiteCommunicationActor` are
long-lived coordinators (they own the per-site ClusterClient map, debug
subscriptions, in-progress deployments) but neither overrides `SupervisorStrategy`.
They fall back to the Akka default (`OneForOneStrategy` with `Restart`). A child fault
— e.g. a `ClusterClient` child of `CentralCommunicationActor` created by
`DefaultSiteClientFactory` — would `Restart` under the default strategy, and any
exception in the coordinator itself would restart it, wiping `_siteClients`,
`_debugSubscriptions`, and `_inProgressDeployments` silently. The design intent is
`Resume` so transient child faults do not discard coordinator state.
**Recommendation**
Override `SupervisorStrategy` on both actors to return an explicit
`OneForOneStrategy` with `Directive.Resume` (or the project's standard coordinator
strategy), matching the documented decision and other coordinator actors.
**Resolution**
_Unresolved._
### Communication-005 — gRPC keepalive and max-stream-lifetime options are defined but never applied
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClient.cs:25`, `src/ScadaLink.Communication/CommunicationOptions.cs:36` |
**Description**
`CommunicationOptions` exposes `GrpcKeepAlivePingDelay`, `GrpcKeepAlivePingTimeout`,
`GrpcMaxStreamLifetime`, and `GrpcMaxConcurrentStreams`, and the design doc's
"gRPC Connection Keepalive" section explicitly states these are configurable. However
`SiteStreamGrpcClient`'s constructor hard-codes `KeepAlivePingDelay =
TimeSpan.FromSeconds(15)` and `KeepAlivePingTimeout = TimeSpan.FromSeconds(10)`
instead of reading the options. `GrpcMaxStreamLifetime` (the documented "Session
timeout — 4 hours" third layer of dead-client detection) is not referenced anywhere
`SiteStreamGrpcServer.SubscribeInstance` creates a linked CTS from the call
cancellation token only, with no `CancelAfter`. The 4-hour zombie-stream safety net
described in the design doc does not exist in code. `GrpcMaxConcurrentStreams` is also
not wired to the server (`SiteStreamGrpcServer` takes a `maxConcurrentStreams`
constructor parameter defaulting to 100, but nothing binds the option to it).
**Recommendation**
Flow `CommunicationOptions` into `SiteStreamGrpcClient` and `SiteStreamGrpcServer`
(via the factory / DI). Apply `GrpcKeepAlivePingDelay` / `GrpcKeepAlivePingTimeout` to
the `SocketsHttpHandler`, bind `GrpcMaxConcurrentStreams` to the server's limit, and
implement the `GrpcMaxStreamLifetime` session timeout with `CancelAfter` on the
server-side stream CTS — or, if the 4-hour cap is intentionally dropped, remove the
option and update the design doc.
**Resolution**
_Unresolved._
### Communication-006 — Site address load failures are silently swallowed, leaving a stale cache
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:204` |
**Description**
`LoadSiteAddressesFromDb` runs the repository query inside `Task.Run(...).PipeTo(self)`.
If `GetAllSitesAsync` throws (database unavailable, transient connection error), the
faulted task is piped to `Self` as a `Status.Failure`. `CentralCommunicationActor` has
no `Receive<Status.Failure>` handler, so the failure becomes an unhandled message
(logged at debug, not surfaced) and the periodic refresh silently fails. If the
*first* startup load fails the actor runs with an empty `_siteClients` map — every
`SiteEnvelope` is dropped (line 187) and every Ask times out with no indication of the
root cause.
**Recommendation**
Add a `Receive<Status.Failure>` handler that logs the load failure at Warning/Error
level so operators can distinguish "site has no addresses configured" from "database
is down". Optionally surface a health metric for repeated load failures.
**Resolution**
_Unresolved._
### Communication-007 — `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async)
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcClientFactory.cs:53` |
**Description**
`Dispose()` calls `DisposeAsync().AsTask().GetAwaiter().GetResult()`. This is the
classic sync-over-async pattern: it blocks the calling thread until all per-site
`SiteStreamGrpcClient.DisposeAsync` calls complete. If `Dispose` is invoked from a
context with a single-threaded synchronization context or from DI container shutdown
on a constrained thread pool, this can deadlock or stall host shutdown. The class
already implements `IAsyncDisposable`.
**Recommendation**
Prefer registering and disposing the factory through `IAsyncDisposable` only (modern
.NET DI honours it for singletons). If a synchronous `Dispose` must remain, dispose
the underlying `GrpcChannel`s directly (synchronous) rather than blocking on the async
path, or document why blocking is safe here.
**Resolution**
_Unresolved._
### Communication-008 — Reconnect retry-count reset can mask a flapping stream indefinitely
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:71`, `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:174` |
**Description**
`_retryCount` is reset to 0 every time a single `AttributeValueChanged` or
`AlarmStateChanged` event is received (lines 72, 77). Combined with `MaxRetries = 3`,
a stream that connects, delivers exactly one event, then fails — repeatedly — will
reconnect forever. The design doc states "max 3 retries, terminate the session if all
retries fail"; the current logic only terminates after 3 *consecutive* failures with
zero intervening events, so a flapping site never trips the limit and the debug
session (and its site-side relay) lives on indefinitely. The `ReceiveTimeout` orphan
net is also reset by every received message, so it does not bound this case either.
**Recommendation**
Either reset `_retryCount` only after the stream has been stably connected for some
minimum duration (e.g. a timer armed on stream open, cancelled on the next error), or
keep a separate cumulative reconnect counter / time window that bounds total
reconnects regardless of intervening events.
**Resolution**
_Unresolved._
### Communication-009 — `_siteClients` field is mutable and reassignable; cache update is not atomic on failure
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:53`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:240` |
**Description**
`_siteClients` is a non-`readonly` `Dictionary` field. It is only mutated on the actor
thread (correct), but the field is needlessly reassignable, and
`HandleSiteAddressCacheLoaded` mutates it in place across several loops. If
`ActorPath.Parse` throws on a malformed address mid-loop (e.g. a site row with a
garbage `NodeAAddress`), the method aborts partway through, having already stopped
some ClusterClients and added others — leaving the cache partially updated with no
recovery until the next 60s refresh. The other actor mutable collections
(`_debugSubscriptions`, `_inProgressDeployments`) are correctly `readonly`.
**Recommendation**
Mark `_siteClients` `readonly`. Validate/parse all addresses up front (or wrap
`ActorPath.Parse` in a try/catch that logs and skips the bad site) so a single
malformed site record cannot abort the whole refresh and leave a half-updated cache.
**Resolution**
_Unresolved._
### Communication-010 — `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor"
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.Communication/Actors/DebugStreamBridgeActor.cs:10` |
**Description**
The class summary opens with "Persistent actor (one per active debug session)...".
The actor derives from `ReceiveActor`, not a persistent actor base class, holds no
`PersistenceId`, and writes no journal/snapshot. "Persistent" is misleading — debug
sessions are explicitly "session-based and temporary" per the design doc. A reader
could assume state survives restart, which it does not.
**Recommendation**
Reword the summary to "Long-lived (per active debug session) actor on the central
side..." or similar, removing the word "Persistent".
**Resolution**
_Unresolved._
### Communication-011 — No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Communication.Tests/` (module-wide) |
**Description**
The test suite covers happy-path routing, handler-not-registered failures, heartbeat
bumping, cache refresh, and gRPC bridge reconnect/retry. However several critical
paths identified in this review have no coverage:
- The `DebugStreamService.StartStreamAsync` snapshot-timeout path (Communication-001)
— no test verifies bridge actor / site subscription teardown on timeout, nor the
`onTerminated`-before-snapshot race that throws a non-`OperationCanceledException`.
- `CentralCommunicationActor` behaviour when `LoadSiteAddressesFromDb` faults
(Communication-006) — `RefreshSiteAddresses_UpdatesCache` only exercises success.
- `SiteStreamGrpcClient` subscription-map overwrite/removal race (Communication-003)
and gRPC reconnect not unsubscribing the old node (Communication-002).
- A malformed `NodeAAddress` aborting `HandleSiteAddressCacheLoaded` (Communication-009).
**Recommendation**
Add tests for: snapshot timeout / pre-snapshot termination cleanup; address-load
failure logging and empty-cache behaviour; reusing a correlation ID across
`SubscribeAsync` calls; and a malformed site address during cache refresh.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,394 @@
# Code Review — ConfigurationDatabase
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.ConfigurationDatabase` |
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
## Summary
The ConfigurationDatabase module is a focused, conventional EF Core data-access layer:
a single `ScadaLinkDbContext`, Fluent API entity configurations, eight repository
implementations of Commons-defined interfaces, an `IAuditService` implementation, an
`IInstanceLocator`, environment-aware migration handling, and design-time tooling
support. Overall structure adheres well to the design doc and the CLAUDE.md "Code
Organization" decisions — POCO entities and interfaces live in Commons, EF mappings and
implementations live here, Fluent API only, and optimistic concurrency is correctly
applied to `DeploymentRecord` via `rowversion`. The module is generally healthy.
The main themes across findings are: (1) a genuine logic bug in
`GetTemplateWithChildrenAsync`, which loads child templates and then discards them, so
the method does not deliver what its name implies; (2) secret-bearing columns (SMTP
credentials, external-system auth config, database connection strings) persisted in
plaintext with no encryption-at-rest; (3) a hardcoded SQL `sa` connection string with a
password literal embedded in `DesignTimeDbContextFactory`; (4) the no-arg
`AddConfigurationDatabase()` overload, which silently registers nothing, making a
misconfigured central node fail late and opaquely; and (5) audit-trail robustness gaps —
`AuditService` can throw on serializing entities with navigation cycles, rolling back
the whole business operation, and the design doc's claim that audit `Id` is `Long/GUID`
disagrees with the `int` entity. Test coverage is good for the repositories that have
tests (Security, CentralUI, audit, concurrency, seed data, data protection) but several
repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
`SiteRepository`, `InstanceLocator`) have little or no direct coverage.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
| 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
| 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). |
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). |
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. |
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. |
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). |
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
## Findings
### ConfigurationDatabase-001 — `GetTemplateWithChildrenAsync` loads child templates then discards them
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs:30-41` |
**Description**
`GetTemplateWithChildrenAsync` queries for all templates whose `ParentTemplateId`
equals the requested id, assigns the result to the local variable `children`, and
then returns `template` — the `children` list is never used, attached to the returned
object, or otherwise exposed. The method is therefore behaviourally identical to
`GetTemplateByIdAsync` but issues an extra database round-trip. Any caller relying on
the method name to obtain a template with its derived/child templates populated will
silently receive a template with no children, leading to incorrect template-resolution
or UI behaviour with no error.
**Recommendation**
Either populate the children onto the returned aggregate (e.g. project into a result
type that carries the children, or load them into a navigation collection that is
actually returned), or remove the dead query and the misleading method if children are
not in fact needed. If the navigation does not exist on the `Template` entity, add an
explicit result tuple/DTO so the loaded data reaches the caller.
**Resolution**
_Unresolved._
### ConfigurationDatabase-002 — Hardcoded `sa` connection string with embedded password literal
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/DesignTimeDbContextFactory.cs:21-22` |
**Description**
`DesignTimeDbContextFactory` falls back to a literal connection string
`"Server=localhost,1433;Database=ScadaLink_Config;User Id=sa;Password=YourPassword;TrustServerCertificate=True"`
when no configured connection string is found. Embedding a credential literal (even a
placeholder) in source code is a poor pattern: it is committed to version control,
encourages copy-paste of `sa`/`TrustServerCertificate=True` into real environments, and
the fallback can mask a genuine misconfiguration during `dotnet ef` operations by
silently pointing tooling at an unintended database.
**Recommendation**
Remove the hardcoded fallback. If no connection string is resolved from configuration
or environment, throw a clear `InvalidOperationException` instructing the developer to
set `ScadaLink:Database:ConfigurationDb` (or an environment variable). At minimum, read
the design-time connection string from an environment variable rather than a literal,
and never use `sa`.
**Resolution**
_Unresolved._
### ConfigurationDatabase-003 — No-arg `AddConfigurationDatabase()` silently registers nothing
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/ServiceCollectionExtensions.cs:44-49` |
**Description**
The parameterless `AddConfigurationDatabase()` overload is a deliberate no-op "retained
for backward compatibility during migration." If a central node is wired up with this
overload by mistake, no `ScadaLinkDbContext`, repositories, `IAuditService`, or
`IInstanceLocator` are registered. The failure does not surface at startup; it surfaces
much later as opaque DI resolution exceptions the first time any consumer requests a
repository — far from the actual misconfiguration. The XML comment also refers to
"Phase 0 stubs," which is stale relative to the current state of the module.
**Recommendation**
Either delete the no-op overload now that the connection-string overload exists, or
mark it `[Obsolete]` with an error-level message so misuse is a compile-time failure.
If a true "site node" no-op is genuinely required, give it an explicit, self-documenting
name (e.g. `AddConfigurationDatabaseNoOp()`), and remove the stale "Phase 0" wording.
**Resolution**
_Unresolved._
### ConfigurationDatabase-004 — Secret-bearing columns stored in plaintext with no protection
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/NotificationConfiguration.cs:56-57`, `src/ScadaLink.ConfigurationDatabase/Configurations/ExternalSystemConfiguration.cs:25-26,75-77` |
**Description**
`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`, and
`DatabaseConnectionDefinition.ConnectionString` all hold authentication secrets (SMTP
OAuth2 client secrets / passwords, external-system API keys or Basic Auth credentials,
and database passwords respectively). They are mapped as ordinary string columns and
persisted verbatim. Anyone with read access to the configuration database — including
audit-log JSON if these entities are serialized into `AfterStateJson` — obtains the
plaintext secrets. The design doc does not call out encryption-at-rest for these
fields, so the design is also silent on a real risk.
**Recommendation**
Apply encryption to these fields, e.g. an EF Core value converter backed by ASP.NET
Data Protection (the module already configures `IDataProtectionKeyContext`), or rely on
SQL Server Always Encrypted / column encryption. Separately, ensure `IAuditService`
callers never pass these secret-bearing entities (or that the serializer redacts the
fields) so secrets do not leak into `AuditLogEntry.AfterStateJson`. Update the design
doc to state the chosen at-rest protection.
**Resolution**
_Unresolved._
### ConfigurationDatabase-005 — Audit `Id` type disagrees with the design doc
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/AuditConfiguration.cs:11` (entity `src/ScadaLink.Commons/Entities/Audit/AuditLogEntry.cs`) |
**Description**
The design doc's Audit Entry Schema table specifies `Id` as `Long / GUID`, and notes
the audit table is append-only and retained indefinitely. The actual `AuditLogEntry`
entity uses an `int` identity key. For a never-purged, append-only table that
accumulates one row per save operation across the system lifetime, a 32-bit identity
risks overflow over a long deployment horizon, and the code drifts from the documented
schema.
**Recommendation**
Change `AuditLogEntry.Id` to `long` (and the corresponding migration column to
`bigint`) to match the design doc and remove the overflow risk, or — if `int` is
intentional — update the design doc's schema table to say `int` and justify it.
Resolve the discrepancy in one direction.
**Resolution**
_Unresolved._
### ConfigurationDatabase-006 — `Site.GrpcNodeAAddress` / `GrpcNodeBAddress` columns are unbounded
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/SiteConfiguration.cs:24-25` |
**Description**
`SiteConfiguration` explicitly sets `HasMaxLength(500)` for `NodeAAddress` and
`NodeBAddress`, but the entity also has `GrpcNodeAAddress` and `GrpcNodeBAddress`
(added per the gRPC streaming design decision) which are not configured at all. With no
length set, EF Core maps them to `nvarchar(max)`. This is inconsistent with the sibling
address columns, wastes the opportunity to constrain input, and `nvarchar(max)` columns
cannot be indexed and have different storage/performance characteristics.
**Recommendation**
Add `builder.Property(s => s.GrpcNodeAAddress).HasMaxLength(500);` and the same for
`GrpcNodeBAddress`, matching the existing `NodeAAddress`/`NodeBAddress` mapping, and
generate a migration to alter the column types.
**Resolution**
_Unresolved._
### ConfigurationDatabase-007 — `AuditService` does not handle JSON-serialization failure of arbitrary `afterState`
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Services/AuditService.cs:28-30` |
**Description**
`LogAsync` serializes the caller-supplied `afterState` object with
`JsonSerializer.Serialize(afterState)` using default options. EF entity POCOs commonly
have navigation properties; serializing an entity that has loaded navigations (e.g. a
`Template` with `Attributes`/`Scripts`, or any entity with a cycle) will throw
`JsonException` for a reference cycle or produce a very large payload. Because audit
writes are designed to commit in the same transaction as the change, a serialization
exception thrown here will roll back the *entire* business operation — a template
update fails because its audit entry could not be serialized. This couples audit
robustness to the shape of every entity passed in.
**Recommendation**
Configure `JsonSerializerOptions` with `ReferenceHandler.IgnoreCycles` (or
`Preserve`) and a sensible `MaxDepth`, and consider serializing a projected
DTO/snapshot rather than the live tracked entity. Decide explicitly whether an audit
serialization failure should fail the operation or be logged and degraded gracefully,
and document that decision against the design doc's transactional-guarantee section.
**Resolution**
_Unresolved._
### ConfigurationDatabase-008 — `GetApprovedKeysForMethodAsync` CSV parsing silently drops malformed ids
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:46-58` |
**Description**
`ApiMethod.ApprovedApiKeyIds` is stored as a comma-separated string of integer ids.
`GetApprovedKeysForMethodAsync` splits it, maps each token with
`int.TryParse(...) ? id : -1`, then filters with `id > 0`. Any token that fails to
parse, or a legitimately negative/zero id, is silently discarded. If `ApprovedApiKeyIds`
becomes corrupt (e.g. a stray name instead of an id), the method quietly returns fewer
approved keys than expected, which for an API-key authorization path means a method may
unexpectedly reject a key that should be approved. Storing a relational many-to-many as
a CSV string in a column is itself fragile (no FK integrity, no cascade on key delete).
**Recommendation**
Short term: log a warning when a token fails to parse instead of silently dropping it,
so corruption is observable. Longer term: replace the CSV column with a proper join
table (`ApiMethodApprovedKey`) with foreign keys to `ApiMethod` and `ApiKey`, which
gives referential integrity and correct cascade behaviour when an API key is deleted.
**Resolution**
_Unresolved._
### ConfigurationDatabase-009 — Multi-collection eager loads issue cartesian-product queries
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs:43-51,53-61`, `src/ScadaLink.ConfigurationDatabase/Repositories/CentralUiRepository.cs:45-55` |
**Description**
`GetAllTemplatesAsync`, `GetTemplatesComposingAsync`, and `GetTemplateTreeAsync` each
`Include` three-to-four sibling collections (`Attributes`, `Alarms`, `Scripts`,
`Compositions`) in a single query. EF Core's default single-query strategy produces a
cartesian-product join across those collections, so a template with N attributes, M
alarms, and K scripts yields N×M×K rows that EF must then de-duplicate. For templates
with many members this materially inflates the result set and query time.
`GetInstanceByIdAsync`/`GetAllInstancesAsync` have the same shape with three
collections.
**Recommendation**
Add `.AsSplitQuery()` to these multi-collection-include queries (or set
`UseQuerySplittingBehavior(QuerySplittingBehavior.SplitQuery)` globally in
`AddConfigurationDatabase`) so each collection is loaded with a separate query and the
cartesian explosion is avoided.
**Resolution**
_Unresolved._
### ConfigurationDatabase-010 — Several repositories and `InstanceLocator` lack direct test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs`, `Repositories/DeploymentManagerRepository.cs`, `Repositories/ExternalSystemRepository.cs`, `Repositories/InboundApiRepository.cs`, `Repositories/NotificationRepository.cs`, `Repositories/SiteRepository.cs`, `Services/InstanceLocator.cs` |
**Description**
The test project covers `SecurityRepository`, `CentralUiRepository`, `AuditService`,
optimistic concurrency, seed data, and Data Protection persistence. There are no direct
tests for `TemplateEngineRepository` (the largest repository, and the one with the
CD-001 bug, which a test would have caught), `DeploymentManagerRepository` (including
its `Local`-then-stub delete fallback and the `DeleteInstanceAsync`
restrict-FK-cleanup logic), `ExternalSystemRepository`, `InboundApiRepository` (notably
`GetApprovedKeysForMethodAsync` CSV parsing — CD-008), `NotificationRepository`,
`SiteRepository` (including its stub-attach delete path), or `InstanceLocator`.
**Recommendation**
Add repository-level tests using the existing `SqliteTestHelper` pattern, covering at
minimum: CRUD round-trips, the stub-attach delete fallbacks in
`DeploymentManagerRepository`/`SiteRepository`, `DeleteInstanceAsync`'s explicit
deployment-record cleanup, `GetApprovedKeysForMethodAsync` with valid/malformed CSV,
and `InstanceLocator.GetSiteIdForInstanceAsync` for found/not-found cases.
**Resolution**
_Unresolved._
### ConfigurationDatabase-011 — Inconsistent constructor null-guarding across repositories/services
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/ExternalSystemRepository.cs:11-14`, `Repositories/InboundApiRepository.cs:11-14`, `Repositories/NotificationRepository.cs:11-14`, `Services/InstanceLocator.cs:13-16` |
**Description**
`SecurityRepository`, `CentralUiRepository`, `TemplateEngineRepository`,
`DeploymentManagerRepository`, `SiteRepository`, and `AuditService` all guard their
injected `ScadaLinkDbContext` with `?? throw new ArgumentNullException(...)`.
`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`, and
`InstanceLocator` assign the constructor argument directly with no guard. This is a
minor consistency/maintainability issue: although the DI container will not normally
supply null, the divergence makes the codebase look unfinished and means a future
hand-constructed instance fails with a less informative `NullReferenceException` later.
**Recommendation**
Apply the same `?? throw new ArgumentNullException(nameof(context))` guard in the four
inconsistent constructors so all data-access types behave uniformly.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,471 @@
# Code Review — DataConnectionLayer
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.DataConnectionLayer` |
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 13 |
## Summary
The DataConnectionLayer is a reasonably well-structured module: the Become/Stash
lifecycle state machine, the captured-`Self` marshalling of background-thread
disconnect events, and the protocol-factory abstraction all follow the design doc
and Akka.NET conventions. However, the review found one **critical** actor-model
violation — `HandleSubscribe` spawns a `Task.Run` that mutates the actor's private
dictionaries and counters from a thread-pool thread, racing with the actor's own
message loop. Several **high**-severity issues cluster around concurrency and error
handling: the subscription-failure path leaves the connection with degraded subtrees
but no real recovery, the `DataConnectionManagerActor`'s `Restart` supervision drops
all subscription state on a connection-actor crash, and `RealOpcUaClient`'s monitored-
item callback dictionary is mutated without synchronization while OPC UA notification
threads read it. The remaining findings concern stale health counters after failover,
an unused `WriteTimeout` option (writes are unbounded despite the design promising a
30 s timeout), `ReadBatchAsync` aborting mid-batch, and documentation drift between
the design doc's failover state machine and the implemented unstable-disconnect
heuristic. Test coverage is adequate for the happy paths and failover but absent for
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
## Findings
### DataConnectionLayer-001 — `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread
| | |
|--|--|
| Severity | Critical |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:473-538` |
**Description**
`HandleSubscribe` launches a `Task.Run(async () => ...)` that runs on a thread-pool
thread and directly mutates the actor's private mutable state: `instanceTags` (a
reference into `_subscriptionsByInstance`), `_subscriptionIds`, `_totalSubscribed`,
`_resolvedTags`, and `_unresolvedTags`. All of these are simultaneously read and
written by the actor's own message loop (`HandleTagValueReceived`, `HandleUnsubscribe`,
`ReSubscribeAll`, `HandleRetryTagResolution`, `ReplyWithHealthReport`). This is a
direct violation of the Akka.NET actor model, which guarantees single-threaded access
to actor state only when state is touched on the actor thread. Two concurrent
subscribe requests, or a subscribe overlapping a `TagValueReceived` / `GetHealthReport`,
produce data races on `Dictionary`/`HashSet`/`int``Dictionary` is not thread-safe
and concurrent mutation can corrupt internal buckets, throw, or lose entries. It can
also produce torn reads of the health counters.
**Recommendation**
Do not mutate actor state from the background task. Perform only the `await
_adapter.SubscribeAsync(...)` / `ReadAsync(...)` I/O in the task, collect the results
into a local immutable result object, and `PipeTo(Self)` an internal message (e.g.
`SubscribeCompleted`) whose handler — running on the actor thread — applies all state
mutations and counter updates. The response to `Sender` should be sent from that
handler too.
**Resolution**
_Unresolved._
### DataConnectionLayer-002 — `Restart` supervision discards all subscription state on connection-actor crash
| | |
|--|--|
| Severity | High |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionManagerActor.cs:131-141` |
**Description**
`DataConnectionManagerActor.SupervisorStrategy` returns a `OneForOneStrategy` with
`Directive.Restart` for `DataConnectionActor` failures. On restart, Akka.NET creates a
fresh actor instance, so all in-memory fields — `_subscriptionsByInstance`,
`_subscriptionIds`, `_subscribers`, `_unresolvedTags`, the quality counters — are
silently discarded. The actor re-enters `Connecting` with zero subscriptions, and the
design doc's "transparent re-subscribe" guarantee (WP-10) is broken: Instance Actors
that had subscribed before the crash never get their tags re-subscribed and will sit
at uncertain/stale quality indefinitely with no error returned. There is no durable
subscription store from which a restarted actor could rebuild state.
**Recommendation**
Either (a) make the subscription registry durable/recoverable so a restarted actor
can rebuild it (persist to local SQLite as the design doc says connection definitions
are, and have `PreStart` reload subscriptions), or (b) treat a connection-actor crash
as a lifecycle event the `DataConnectionManagerActor` notices, so it can re-issue the
subscription registrations. At minimum document that subscribers must re-register
after a crash and surface the lost-state condition rather than failing silently.
**Resolution**
_Unresolved._
### DataConnectionLayer-003 — `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:16-17,130-131,153,163,173,183-184` |
**Description**
`_monitoredItems` and `_callbacks` are plain `Dictionary<,>` instances. They are
written from `CreateSubscriptionAsync` / `RemoveSubscriptionAsync` (invoked from the
`DataConnectionActor`'s `Task.Run` / `ContinueWith` continuations, i.e. thread-pool
threads) and from `DisconnectAsync` (`.Clear()`), while being read concurrently from
the OPC Foundation SDK's `MonitoredItem.Notification` event handler, which fires on
the SDK's internal publish threads (`_callbacks.TryGetValue(handle, ...)` at line
163). Concurrent reads during a `Dictionary` resize or `Clear()` are undefined
behaviour — they can throw `InvalidOperationException`, return wrong entries, or
corrupt the dictionary. The `DataConnectionActor`'s subscribe path already runs off
the actor thread (finding 001), so multiple subscribe calls can also race each other
here.
**Recommendation**
Use `ConcurrentDictionary<,>` for `_monitoredItems` and `_callbacks`, or guard all
access with a lock. Note that fixing finding 001 (serialising subscribe through the
actor thread) reduces but does not eliminate the race, because the SDK notification
threads still read `_callbacks` concurrently with `RemoveSubscriptionAsync` /
`DisconnectAsync`.
**Resolution**
_Unresolved._
### DataConnectionLayer-004 — Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:495-503,529-537` |
**Description**
When `_adapter.SubscribeAsync` throws inside the `HandleSubscribe` background task,
the catch block adds the tag to `_unresolvedTags` and increments `_totalSubscribed`,
treating every subscribe exception as a tag-resolution failure. But `SubscribeAsync`
also throws `InvalidOperationException` from `EnsureConnected()` when the OPC UA
client is not connected, and throws on transport faults — these are connection
problems, not bad tag paths. They get misclassified as unresolved tags and retried on
the 10 s tag-resolution timer instead of triggering the reconnection state machine.
Worse, the design doc (Tag Path Resolution, step 2) says the failed tag's attribute
must be marked quality `bad`; the code never pushes a bad-quality update to the
subscriber for a tag that fails to resolve at subscribe time, so the Instance Actor
stays at uncertain quality with no signal. The `TagResolutionFailed` message it sends
to `Self` only logs and re-arms the timer (`HandleTagResolutionFailed`).
**Recommendation**
Distinguish connection-level exceptions (raise `AdapterDisconnected` / let the
reconnect machine handle them) from genuine node-not-found errors. For genuine
resolution failures, push a `TagValueUpdate` with `QualityCode.Bad` to the subscribing
Instance Actor so it reflects the documented behaviour.
**Resolution**
_Unresolved._
### DataConnectionLayer-005 — `WriteTimeout` option is documented and configured but never applied
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/DataConnectionOptions.cs:15`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:573-590` |
**Description**
`DataConnectionOptions.WriteTimeout` (default 30 s) and the design doc's "Shared
Settings" table both promise a bounded timeout for synchronous device writes. The
value is never read anywhere in the module (`grep` confirms only the declaration).
`HandleWrite` calls `_adapter.WriteAsync(request.TagPath, request.Value)` with no
`CancellationToken` and no timeout. If the OPC UA server hangs (TCP black-hole, no
RST), the write `Task` never completes, `PipeTo(sender)` never fires, and the calling
script's Ask blocks until its own ask-timeout — and the script gets no DCL-level
error. The design states write failures (including timeout) must be returned
synchronously to the script; an unbounded write violates that.
**Recommendation**
Create a `CancellationTokenSource(_options.WriteTimeout)`, pass its token to
`WriteAsync`, and in the continuation translate cancellation into a failed
`WriteTagResponse` with a timeout error message. Apply the same to the read used by
the initial-value seed and to `WriteBatchAndWaitAsync` paths if they are reachable.
**Resolution**
_Unresolved._
### DataConnectionLayer-006 — Health quality counters not reset/recomputed after failover or re-subscribe
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:645-673,721-756` |
**Description**
`ReSubscribeAll` resets `_subscriptionIds`, `_unresolvedTags` and `_resolvedTags` to a
clean slate, but leaves `_lastTagQuality`, `_tagsGoodQuality`, `_tagsBadQuality` and
`_tagsUncertainQuality` untouched. `PushBadQualityForAllTags` (called on disconnect)
sets `_tagsBadQuality = _lastTagQuality.Count` and zeroes the others. After a
reconnect, `HandleTagValueReceived` decrements the *old* bucket using
`_lastTagQuality`'s value and increments the new one — but tags resolved for the first
time after reconnect were never in `_lastTagQuality`, so they only increment, never
decrement, and the totals can drift above `_totalSubscribed`. Over repeated
disconnect/reconnect cycles the health report's good/bad/uncertain counts become
unreliable.
**Recommendation**
On `BecomeConnected` after a re-subscribe (or in `ReSubscribeAll`), clear
`_lastTagQuality` and the three quality counters and let them be repopulated from
fresh `TagValueReceived` messages. Alternatively recompute the buckets from
`_lastTagQuality` whenever it changes rather than maintaining incremental counters.
**Resolution**
_Unresolved._
### DataConnectionLayer-007 — `ReadBatchAsync` aborts the whole batch on the first failing tag
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:187-195` |
**Description**
`ReadBatchAsync` loops calling `ReadAsync` per tag. `ReadAsync` re-throws any
non-cancellation exception (line 184). So if any single tag in the batch throws (bad
node, transient fault), the entire `ReadBatchAsync` throws and the caller gets no
results for the tags that *did* read successfully — even though `ReadResult` already
has a `Success`/`ErrorMessage` shape designed to carry per-tag failures. The batch is
also fully serial (one round-trip per tag), defeating the point of a batch API; the
design doc lists `ReadBatch`/`WriteBatch` as first-class operations.
**Recommendation**
Catch per-tag exceptions inside the loop and store a failed `ReadResult` for that tag
so the batch returns a complete map. Ideally issue a single OPC UA `Read` service call
for all node IDs (`RealOpcUaClient.ReadValueAsync` already builds a
`ReadValueIdCollection` — extend it to accept multiple nodes).
**Resolution**
_Unresolved._
### DataConnectionLayer-008 — `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:540-569` |
**Description**
For each tag of the instance being removed, `HandleUnsubscribe` scans every other
instance's tag set (`_subscriptionsByInstance.Where(...).Any()`), making the operation
O(tags x instances). On a site with many instances sharing a connection this is
needlessly expensive on every instance stop/redeploy. Separately, line 562
re-evaluates `!_unresolvedTags.Contains(tagPath)` immediately after line 561 already
removed `tagPath` from `_unresolvedTags`, so the condition is always true — dead
logic that obscures intent (the decrement of `_resolvedTags` is unconditional in
practice).
**Recommendation**
Maintain a reference count per tag path (or a `tagPath -> set<instance>` reverse index)
so the "any other subscriber" check is O(1). Remove the redundant `_unresolvedTags`
re-check or restructure so the resolved/unresolved decrement reflects the tag's actual
prior state captured before removal.
**Resolution**
_Unresolved._
### DataConnectionLayer-009 — Implemented failover heuristic diverges from the documented state machine
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:189,242-297,379-449`, `docs/requirements/Component-DataConnectionLayer.md:73-85` |
**Description**
The design doc's failover state machine reads "retry active endpoint (5s) -> N failures
(>= FailoverRetryCount) -> switch to other endpoint". The code implements two *separate*
failover triggers: (a) `HandleReconnectResult` counts `_consecutiveFailures` on
connect-attempt failures (matches the doc), and (b) `BecomeReconnecting` additionally
counts `_consecutiveUnstableDisconnects` — connections that succeeded but dropped
within a hard-coded 60 s `StableConnectionThreshold` — and fails over on that count
too. The unstable-disconnect path, the 60 s threshold, and the fact that failover can
happen on *successful-but-flaky* connections are not described in the component doc at
all. A reviewer or operator reading `Component-DataConnectionLayer.md` would not
predict this behaviour, and the 60 s threshold is a magic constant not exposed via
`DataConnectionOptions`.
**Recommendation**
Update `Component-DataConnectionLayer.md` to document the unstable-disconnect failover
path and the stability threshold, and move the 60 s threshold into
`DataConnectionOptions` so it is configurable and consistent with the other tunables.
**Resolution**
_Unresolved._
### DataConnectionLayer-010 — Tag-resolution retry can issue duplicate concurrent subscribe attempts
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:594-619,689-703` |
**Description**
`HandleRetryTagResolution` fires `SubscribeAsync` for every tag in `_unresolvedTags`
via `ContinueWith(...).PipeTo(self)`, but does **not** remove the tags from
`_unresolvedTags` while the attempts are in flight. Because tags are not removed
before the retry, a slow `SubscribeAsync` overlapping the next 10 s tick issues
duplicate concurrent subscribe attempts for the same tag, which can create duplicate
monitored items / leaked subscription IDs (the second success overwrites
`_subscriptionIds[tag]` in `HandleTagResolutionSucceeded`, orphaning the first handle
with no `UnsubscribeAsync` call). The timer-cancel condition in
`HandleTagResolutionSucceeded` is also non-deterministic for the same reason.
**Recommendation**
Remove tags from `_unresolvedTags` (into an "in-flight" set) when a retry is
dispatched, and only put them back on failure. This prevents overlapping duplicate
subscribe attempts and makes the timer-cancel condition deterministic.
**Resolution**
_Unresolved._
### DataConnectionLayer-011 — Stale subscription callbacks from disposed adapters can still reach the actor
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:486-489,278-285,416-425`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:252-262` |
**Description**
On failover the actor disposes the old adapter (`_adapter.DisposeAsync()`,
fire-and-forget) and creates a fresh one. The old adapter's subscription callbacks
captured `self` and `tagPath` and `Tell` `TagValueReceived` to the actor. While the
`Reconnecting` handler ignores `TagValueReceived` (line 334), once the actor reaches
`Connected` again it processes them — and a disposed adapter whose OPC UA SDK threads
have not yet fully torn down could still deliver a value, mixing pre-failover device
data with the new endpoint's data and briefly reporting a value the active endpoint
never produced. There is no per-adapter generation/epoch tag on `TagValueReceived` to
distinguish current from stale callbacks.
**Recommendation**
Add an adapter-generation counter incremented on every adapter swap; stamp it onto
`TagValueReceived` (captured in the callback closure) and drop messages whose
generation does not match the current adapter in `HandleTagValueReceived`.
**Resolution**
_Unresolved._
### DataConnectionLayer-012 — `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/IOpcUaClient.cs:17`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:49,60-61`, `docs/requirements/Component-DataConnectionLayer.md:116` |
**Description**
`OpcUaConnectionOptions.AutoAcceptUntrustedCerts` defaults to `true`, and
`RealOpcUaClient.ConnectAsync` wires `CertificateValidator.CertificateValidation += (_, e) => e.Accept = true`
when it is set. With the default, every server certificate is accepted unconditionally
— there is no certificate-pinning or trust-store enforcement — which defeats the
`Sign`/`SignAndEncrypt` security modes against an active man-in-the-middle on the OPC
UA link. The design doc explicitly lists `true` as the default. For an industrial
control link this is a meaningful exposure; a secure-by-default posture would reject
untrusted certs unless an operator opts in per connection.
**Recommendation**
Default `AutoAcceptUntrustedCerts` to `false` and require explicit per-connection
opt-in, or at minimum log a prominent warning whenever the auto-accept validator is
installed. Update the design doc to reflect the secure default.
**Resolution**
_Unresolved._
### DataConnectionLayer-013 — Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:270-281` |
**Description**
The XML doc on `RaiseDisconnected` states "Thread-safe: only the first caller triggers
the event." The implementation is a non-atomic check-then-set on a `volatile bool`
(`if (_disconnectFired) return; _disconnectFired = true;`). `volatile` guarantees
visibility, not atomicity — two threads (e.g. the OPC UA keep-alive thread via
`OnClientConnectionLost` and a `ReadAsync` failure path) can both observe
`_disconnectFired == false` and both invoke `Disconnected`. In practice the
`DataConnectionActor` tolerates a duplicate `AdapterDisconnected` message, so impact
is low, but the comment overstates the guarantee. The same pattern exists in
`RealOpcUaClient.OnSessionKeepAlive` (`_connectionLostFired`).
**Recommendation**
Either make the guard atomic (`Interlocked.Exchange` with an `int` flag, or a lock),
or correct the comment to say "best-effort once-only; a duplicate event is possible
under a race and is tolerated downstream."
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,493 @@
# Code Review — DeploymentManager
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.DeploymentManager` |
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 14 |
## Summary
The DeploymentManager module is small, well-structured, and clearly maps work
packages (WP-N) onto code. The happy paths for instance deployment, lifecycle
commands, artifact broadcast, and staleness comparison are implemented
sensibly, and the operation lock correctly serializes mutating operations per
instance while allowing cross-instance parallelism. However, the review found a
significant cluster of error-handling and resilience gaps: the deployment
record can be left permanently stuck in `InProgress` when an exception other
than timeout/cancellation is thrown, the catch block writes its failure status
using a cancellation token that may already be cancelled, and the
`OperationLockManager` leaks one `SemaphoreSlim` per instance name forever.
There are also two notable design-document adherence gaps: the
"query-the-site-before-redeploy" idempotency requirement is not implemented
(`GetDeploymentStatusAsync` only reads the local DB), and the "Diff View"
feature is reduced to a bare hash comparison with no added/removed/changed
detail. Configuration is not bound to `appsettings.json`, leaving one option
entirely dead. Test coverage stops at the communication boundary and never
exercises a successful deployment or the lifecycle success paths.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
## Findings
### DeploymentManager-001 — Unexpected exceptions leave the deployment record stuck in `InProgress`
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:141-199` |
**Description**
`DeployInstanceAsync` sets the record to `InProgress` (lines 137-139), then the
`try` block calls into `CommunicationService` and the repository. The only
`catch` filter is `when (ex is TimeoutException or OperationCanceledException)`.
Any other exception — `InvalidOperationException` (thrown by
`CommunicationService.GetCommunicationActor()` when the actor is not set), a
JSON serialization error, a deserialization failure of the response, a DB
exception on `UpdateDeploymentRecordAsync`, or any transport error — escapes the
method. The deployment record remains in `DeploymentStatus.InProgress`
permanently. Because staleness and the UI both read current status, the
instance is then misreported as "deploying" forever and a re-deploy may be
blocked or misinterpreted. The design explicitly states an interrupted
deployment must be "treated as failed".
**Recommendation**
Broaden the catch to a general `catch (Exception ex)` that records
`DeploymentStatus.Failed` with the error message, audit-logs the failure, and
re-throws or returns a failed `Result`. Keep the timeout-specific branch only
if a distinct message is desired. Ensure the failure-status write happens for
every exit path out of the `try`.
**Resolution**
_Unresolved._
### DeploymentManager-002 — Failure-status write uses a possibly-cancelled cancellation token
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:186-196` |
**Description**
The `catch (Exception ex) when (ex is TimeoutException or
OperationCanceledException)` block updates the record to `Failed` and calls
`UpdateDeploymentRecordAsync`/`SaveChangesAsync`/`LogAsync` passing the same
`cancellationToken` that was just cancelled (an `OperationCanceledException`
caught here means the token is already in the cancelled state). Those
repository and audit calls will themselves throw `OperationCanceledException`
before the failure status is persisted, so the record stays `InProgress` — the
exact bug DeploymentManager-001 describes, reached via the supposedly-handled
path.
**Recommendation**
Perform the cleanup writes with a fresh, non-cancellable token (e.g.
`CancellationToken.None`, optionally with an independent short timeout) so the
failure status is durably recorded even when the original operation was
cancelled or timed out.
**Resolution**
_Unresolved._
### DeploymentManager-003 — Successful-deployment cleanup is not atomic with the status write
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:155-170` |
**Description**
After a successful site response the code calls `UpdateDeploymentRecordAsync`
(no `SaveChanges` yet), then `UpdateInstanceAsync`, then
`StoreDeployedSnapshotAsync` (which itself issues `Add`/`Update` calls), then a
single `SaveChangesAsync` at line 170. If `StoreDeployedSnapshotAsync` throws,
the exception is not caught (see DeploymentManager-001) and the
`SaveChangesAsync` never runs — the instance state, deployment status, and
snapshot are all left unpersisted even though the site has actually applied the
deployment. Central and site are now divergent: the site is running the new
config but central still shows the old state and a non-`Success` deployment
record.
**Recommendation**
Wrap the post-success persistence so that, at minimum, the deployment record's
`Success` status is committed. Consider committing the status first, then the
instance state and snapshot, so a later failure does not lose the fact that the
site succeeded. Log loudly if the snapshot write fails after a confirmed site
apply.
**Resolution**
_Unresolved._
### DeploymentManager-004 — Site-success but central-delete-failure leaves orphaned site config
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:312-319` |
**Description**
In `DeleteInstanceAsync`, when the site responds `Success` the code calls
`_repository.DeleteInstanceAsync` then `SaveChangesAsync`. If `SaveChangesAsync`
throws (DB error, concurrency), the exception propagates uncaught: the site has
already destroyed the Instance Actor and removed its config, but the central
instance record still exists. The instance is now un-deletable through the
normal path (the site no longer has it, so a re-issued delete may fail) and is
permanently orphaned. The design states central must not mark the instance
deleted until the site confirms — but it does not address the inverse failure.
**Recommendation**
Catch persistence failures in the post-success block and surface a distinct
error indicating the site succeeded but the central record could not be
removed, so an operator/retry can reconcile. Consider making the central delete
idempotent and retryable independently of the site command.
**Resolution**
_Unresolved._
### DeploymentManager-005 — `OperationLockManager` leaks a `SemaphoreSlim` per instance name
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/OperationLockManager.cs:15-33` |
**Description**
`AcquireAsync` does `_locks.GetOrAdd(instanceUniqueName, _ => new
SemaphoreSlim(1, 1))` and entries are never removed. Every distinct instance
unique name that is ever deployed/disabled/enabled/deleted permanently adds a
`SemaphoreSlim` (an `IDisposable` holding a kernel wait handle) to the
dictionary. Over the lifetime of a long-running central process — especially
with the bulk "deploy all out-of-date instances" workflow and instances that
are created and deleted over time — this is an unbounded leak of both managed
memory and OS handles. Deleted instances' semaphores are never reclaimed.
**Recommendation**
Either accept the leak explicitly and document the expected bounded cardinality
of instance names, or implement reclamation: e.g. ref-count handles and remove
+ `Dispose()` the semaphore when the count reaches zero and the lock is free.
At minimum, remove the semaphore entry when an instance is deleted
(`DeleteInstanceAsync`).
**Resolution**
_Unresolved._
### DeploymentManager-006 — Query-the-site-before-redeploy idempotency requirement not implemented
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:84-200,363-368` |
**Description**
The design ("Deployment Identity & Idempotency") requires: "After a central
failover or timeout, the Deployment Manager queries the site for current
deployment state before allowing a re-deploy. This prevents duplicate
application and out-of-order config changes." The code never does this.
`GetDeploymentStatusAsync` only reads the local `DeploymentRecord` from the DB
(`GetDeploymentByDeploymentIdAsync`) — it does not contact the site.
`DeployInstanceAsync` unconditionally generates a new deployment ID and sends a
new `DeployInstanceCommand` regardless of any prior in-flight or timed-out
deployment. After a timeout where the site actually applied the config, a
re-deploy produces a second deployment with no reconciliation against the
site's current revision hash. Site-side stale-rejection is the only safety
net, and that is not verified here.
**Recommendation**
Add a site query (a new `CommunicationService` pattern returning the site's
currently-applied deployment ID / revision hash) and call it before re-deploy
when a prior record for the instance is in `InProgress`/`Failed` due to
timeout. Reconcile: if the site already has the target revision, mark the prior
record `Success` instead of re-sending. Either implement this or update the
design doc to reflect that reconciliation is delegated entirely to site-side
stale-rejection.
**Resolution**
_Unresolved._
### DeploymentManager-007 — "Diff View" reduced to a hash comparison with no diff detail
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:334-358,401-406` |
**Description**
The design ("Diff View" and "Dependencies" sections) states the Deployment
Manager can request a diff from the Template Engine showing added/removed
members, changed values, and connection-binding changes.
`GetDeploymentComparisonAsync` and `DeploymentComparisonResult` only compare two
revision hashes and return a boolean `IsStale` plus the two hashes. No
added/removed/changed detail is produced, and the Template Engine's diff
capability is not invoked. The UI cannot render a meaningful diff from this
result.
**Recommendation**
Either implement a real diff (deserialize the stored
`DeployedConfigSnapshot.ConfigurationJson` and the freshly flattened config and
invoke the Template Engine's diff service, surfacing structured
added/removed/changed entries), or revise the design doc to scope the feature
down to staleness detection only.
**Resolution**
_Unresolved._
### DeploymentManager-008 — `DeploymentManagerOptions` is never bound to configuration
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ServiceCollectionExtensions.cs:7-14` |
**Description**
`AddDeploymentManager` registers the services but never calls
`services.Configure<DeploymentManagerOptions>(configuration.GetSection(...))`.
`IOptions<DeploymentManagerOptions>` therefore always resolves to a
default-constructed instance — the operation-lock and artifact-deployment
timeouts cannot be tuned via `appsettings.json`, contrary to the CLAUDE.md
convention "Per-component configuration via `appsettings.json` sections bound
to options classes (Options pattern)." `Host/Program.cs` binds
`SecurityOptions` and `InboundApiOptions` from configuration sections but has
no equivalent for `DeploymentManagerOptions`.
**Recommendation**
Add an `IConfiguration` parameter (or a configure callback) to
`AddDeploymentManager` and bind `DeploymentManagerOptions` to a section such as
`ScadaLink:DeploymentManager`, consistent with the other components.
**Resolution**
_Unresolved._
### DeploymentManager-009 — Misleading timeout comment on `DeleteInstanceAsync`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:288` |
**Description**
The XML doc says "Delete fails if site unreachable (30s timeout via
CommunicationOptions)." The actual delete timeout is whatever
`CommunicationOptions.LifecycleTimeout` is configured to (passed inside
`CommunicationService.DeleteInstanceAsync`); the "30s" figure is hard-coded
into the comment and not derived from any constant in this module. If
`LifecycleTimeout` is reconfigured, the comment becomes wrong. It also wrongly
implies the value lives in this module.
**Recommendation**
Reword to "Delete fails if the site is unreachable within
`CommunicationOptions.LifecycleTimeout`" without quoting a specific number.
**Resolution**
_Unresolved._
### DeploymentManager-010 — `SystemArtifactDeploymentRecord` does not persist the deployment ID
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:136,194-211` |
**Description**
`DeployToAllSitesAsync` generates a `deploymentId` (line 136) and returns it in
the `ArtifactDeploymentSummary` and audit log, but the persisted
`SystemArtifactDeploymentRecord` has no field for it (the entity only has `Id`,
`ArtifactType`, `DeployedBy`, `DeployedAt`, `PerSiteStatus`). The deployment ID
that appears in the UI summary and audit log cannot be correlated back to the
stored record. Additionally each per-site `DeployArtifactsCommand` carries its
own separate GUID (`BuildDeployArtifactsCommandAsync` line 114), so there are in
fact N+1 unrelated IDs for one logical artifact deployment.
**Recommendation**
Add a `DeploymentId` column to `SystemArtifactDeploymentRecord` and store the
single logical `deploymentId`; reuse that ID (or a derived per-site ID) for the
per-site commands so the audit log, UI summary, and persisted record agree.
**Resolution**
_Unresolved._
### DeploymentManager-011 — Tests never exercise a successful deployment or lifecycle success path
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/DeploymentServiceTests.cs:100-151,155-199` |
**Description**
`DeploymentServiceTests` never sets the `CommunicationService` actor, so every
deploy/lifecycle test deliberately stops at the `InvalidOperationException`
thrown by `GetCommunicationActor()` (see lines 118-125, 147). As a result there
is no test covering: a successful deployment (`DeploymentStatus.Success`
response → instance state set to `Enabled`, snapshot stored, audit logged); a
failed-but-handled site response; the `InProgress`-stuck bug
(DeploymentManager-001); successful Disable/Enable/Delete; or the operation
lock actually serializing two concurrent deploys of the same instance. The
critical post-response branch (`DeploymentService.cs:154-184`) and the entire
delete/disable/enable success path are untested. The `AuditLogs` test
(lines 277-289) asserts nothing.
**Recommendation**
Introduce a seam to inject a fake/substitute communication path (e.g. an
interface over `CommunicationService`, or wire a TestKit actor) so success and
handled-failure paths can be unit tested. Add tests for the stuck-`InProgress`
scenario and for per-instance lock contention during deploy. Make the audit
test assert on `IAuditService.LogAsync`.
**Resolution**
_Unresolved._
### DeploymentManager-012 — `LifecycleCommandTimeout` option is dead code
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/DeploymentManagerOptions.cs:8-9` |
**Description**
`DeploymentManagerOptions.LifecycleCommandTimeout` is declared with a 30s
default and an XML doc, but it is never read anywhere in the codebase
(lifecycle commands rely on `CommunicationOptions.LifecycleTimeout` inside
`CommunicationService`). The option misleads readers into thinking it controls
disable/enable/delete timeouts, when setting it has no effect.
**Recommendation**
Remove `LifecycleCommandTimeout`, or actually thread it through to the
lifecycle command calls (e.g. by creating a linked CTS with this timeout in
`DisableInstanceAsync`/`EnableInstanceAsync`/`DeleteInstanceAsync`, the way
`ArtifactDeploymentTimeoutPerSite` is used).
**Resolution**
_Unresolved._
### DeploymentManager-013 — SMTP credentials serialized and broadcast to all sites
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.DeploymentManager/ArtifactDeploymentService.cs:108-111` |
**Description**
`BuildDeployArtifactsCommandAsync` maps `smtp.Credentials` directly into
`SmtpConfigurationArtifact` and that command is sent to every site. Distributing
SMTP credentials to sites is consistent with the design (SMTP configuration is
a deployable artifact), but the credentials travel inside a serialized command
across the inter-cluster transport and are stored on each site's SQLite. There
is no indication the value is encrypted at rest on the site or scrubbed from
logs. Worth confirming the transport is TLS-protected and the site stores the
credential securely; at minimum this should be a conscious, documented decision.
**Recommendation**
Confirm inter-cluster transport encryption covers artifact commands, ensure
`Credentials` is never written to logs, and document the at-rest protection of
SMTP credentials on site SQLite. Consider encrypting the credential field
within the artifact payload.
**Resolution**
_Unresolved._
### DeploymentManager-014 — Dead `CreateCommand` helper in artifact tests
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.DeploymentManager.Tests/ArtifactDeploymentServiceTests.cs:86-90` |
**Description**
The private static `CreateCommand()` helper is never referenced by any test in
the file. It is dead code that suggests an intended test (e.g. a successful
multi-site artifact deployment) was never written — coverage of
`DeployToAllSitesAsync` is limited to the no-sites failure case, and
`RetryForSiteAsync` and `BuildDeployArtifactsCommandAsync` have no tests at all.
**Recommendation**
Either remove the unused helper or, preferably, write the missing tests for
`DeployToAllSitesAsync` (per-site success/failure matrix, partial failure) and
`RetryForSiteAsync` using it.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,512 @@
# Code Review — ExternalSystemGateway
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.ExternalSystemGateway` |
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 14 |
## Summary
The External System Gateway is a small module (five source files plus options) that
implements the HTTP/REST client (`ExternalSystemClient`), the database access surface
(`DatabaseGateway`), and error classification (`ErrorClassifier`). The structure is
clean and the dual call-mode semantics broadly match the design doc. However, the
review surfaced several substantive problems that prevent the module from behaving as
designed. The most serious is that **no store-and-forward delivery handler is ever
registered** for the `ExternalSystem` or `CachedDbWrite` categories, so cached calls
and cached writes are buffered but can never actually be delivered on retry — a silent
data-loss path. Two further high-impact issues are that the **per-system call timeout
is never applied** to the HTTP client (the design's central error-handling guarantee
is absent), and that **`CachedCall` double-dispatches the HTTP request** because
`StoreAndForwardService.EnqueueAsync` itself re-attempts immediate delivery, breaking
the idempotency expectations. A cluster of medium issues concern resource leaks,
classification gaps (cancellation conflation), and the dropped `StoreAndForwardResult`.
Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGateway`
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
requirements (timeout, retry settings) that are declared but not implemented.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
## Findings
### ExternalSystemGateway-001 — No S&F delivery handler registered; cached calls and writes can never be delivered
| | |
|--|--|
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:81` |
**Description**
`CachedCallAsync` and `CachedWriteAsync` enqueue messages under
`StoreAndForwardCategory.ExternalSystem` and `StoreAndForwardCategory.CachedDbWrite`.
`StoreAndForwardService.RegisterDeliveryHandler` is the only mechanism that lets the
S&F engine actually deliver a buffered message, and a repository-wide search shows it
is **never called for either category** anywhere in the codebase. Consequences:
1. On a transient failure, `EnqueueAsync` falls through to the "No handler registered
— buffer for later" branch (`StoreAndForwardService.cs:163`) and the message is
persisted.
2. During the retry sweep, `AttemptDeliveryAsync` (`StoreAndForwardService.cs:201`)
logs `"No delivery handler for category {Category}"` and returns without ever
removing or delivering the message.
The result is that every cached external call and cached DB write is silently
buffered forever and never delivered — a data-loss path for the exact "deferred
delivery is acceptable" use cases the design doc calls out (posting production data,
quality reports). The script also receives `WasBuffered: true` / a successful
`CachedWriteAsync` completion, so the failure is completely invisible.
**Recommendation**
Register delivery handlers for `StoreAndForwardCategory.ExternalSystem` and
`StoreAndForwardCategory.CachedDbWrite` during host/site startup. The `ExternalSystem`
handler should deserialize the payload, re-resolve the system/method, and re-invoke
`InvokeHttpAsync`, returning `true`/`false`/throwing per the transient-vs-permanent
contract `EnqueueAsync` expects. The `CachedDbWrite` handler should execute the SQL
against the named connection. Add an integration test that buffers a message and
verifies it is delivered by a retry sweep.
**Resolution**
_Unresolved._
### ExternalSystemGateway-002 — Per-system call timeout is never applied to HTTP requests
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:130`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
**Description**
The design doc states each external system definition specifies a timeout that
"applies to all method calls on that system" and "applies to the HTTP request
round-trip", and `ExternalSystemGatewayOptions.DefaultHttpTimeout` exists as a
fallback. In practice no timeout is ever configured. `ServiceCollectionExtensions`
calls `services.AddHttpClient()` with no per-named-client configuration, and
`InvokeHttpAsync` calls `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")`
without setting `client.Timeout` or passing a `CancellationToken` derived from a
timeout. `SendAsync` is therefore subject only to `HttpClient`'s default 100-second
timeout, regardless of the system definition or the configured `DefaultHttpTimeout`.
A slow or hung external system will block the calling Script Execution Actor far
longer than the operator configured, and the design's core error-handling guarantee
(timeout → transient classification) does not hold within the intended window.
There is also no `Timeout` field on `ExternalSystemDefinition` at all, so even a
correct implementation has nowhere to read the per-system value from — the entity is
missing the field the design requires.
**Recommendation**
Add a `Timeout` (TimeSpan) field to `ExternalSystemDefinition` and have
`InvokeHttpAsync` enforce it — either by setting `client.Timeout` via a typed/named
`HttpClient` registration, or by linking a `CancellationTokenSource` with the
per-system (or `DefaultHttpTimeout`) timeout to the supplied `cancellationToken`
before `SendAsync`. Ensure the resulting `TaskCanceledException`/`TimeoutException`
is classified as transient.
**Resolution**
_Unresolved._
### ExternalSystemGateway-003 — `CachedCall` double-dispatches the HTTP request
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:84-117` |
**Description**
`CachedCallAsync` first calls `InvokeHttpAsync` directly (line 86). On a
`TransientExternalSystemException` it then calls `_storeAndForward.EnqueueAsync(...)`
(line 109). `StoreAndForwardService.EnqueueAsync` is **not** a pure enqueue — it
"Attempts immediate delivery" by invoking the registered delivery handler
(`StoreAndForwardService.cs:128-159`). If a delivery handler for the `ExternalSystem`
category is registered (as finding 001 recommends), the HTTP request will be executed
a **second time** synchronously inside `EnqueueAsync`, immediately after the first
attempt failed. For a transient failure that is actually a slow/overloaded system,
this doubles the load and — critically — if the original request did reach the
external system, the immediate retry produces a duplicate delivery before the script
even returns, worsening the idempotency hazard the design doc explicitly warns about.
**Recommendation**
Decide on one dispatch path. Either (a) have `CachedCall` not pre-invoke
`InvokeHttpAsync` and instead let `EnqueueAsync`'s immediate-delivery attempt be the
single first attempt (requires the handler to exist and to surface permanent vs
transient correctly); or (b) add an enqueue-only entry point to
`StoreAndForwardService` that skips the immediate-delivery attempt, and have
`CachedCall` use it after its own first attempt. Approach (a) is cleaner and removes
the duplicated logic.
**Resolution**
_Unresolved._
### ExternalSystemGateway-004 — System retry settings are not honoured for cached calls/writes
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:114-115`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:86-87` |
**Description**
`CachedCallAsync` and `CachedWriteAsync` pass the definition's `MaxRetries` /
`RetryDelay` to `EnqueueAsync` only when they are non-default
(`MaxRetries > 0 ? ... : null`, `RetryDelay > TimeSpan.Zero ? ... : null`), otherwise
falling back to the S&F defaults. The site-side repository that supplies these
definitions, `SiteExternalSystemRepository.MapExternalSystem`
(`src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:194`), never
reads `MaxRetries`/`RetryDelay` from SQLite at all — the constructed entities always
have `MaxRetries == 0` and `RetryDelay == TimeSpan.Zero`. As a result, at sites the
per-system retry settings the design doc requires are *always* discarded and the
global S&F defaults are silently used instead. The `> 0` guard in the ESG also makes
a legitimately-configured `MaxRetries` of 0 ("never retry") indistinguishable from
"unset", so an operator cannot express "do not retry".
**Recommendation**
Within this module, drop the `> 0` / `> Zero` guards and pass the definition values
through directly (or use nullable fields on the entity to distinguish "unset"). The
companion fix in `SiteExternalSystemRepository` to actually map the retry columns
should be tracked against the SiteRuntime module.
**Resolution**
_Unresolved._
### ExternalSystemGateway-005 — `HttpRequestMessage` and `HttpResponseMessage` are not disposed
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:133-167` |
**Description**
`InvokeHttpAsync` creates an `HttpRequestMessage` (line 133) and receives an
`HttpResponseMessage` from `SendAsync` (line 155); neither is wrapped in a `using` nor
explicitly disposed. Both are `IDisposable` and own resources (the request's
`StringContent`, the response's content stream). Under the per-invocation call volume
of a busy site this produces avoidable pressure on the finalizer queue and can hold
socket/stream resources longer than necessary. The success path reads the content but
never disposes the response; the error path likewise reads `errorBody` and then throws
without disposing.
**Recommendation**
Wrap the request in `using var request = ...` and the response in
`using var response = ...` (or call `Dispose()` in a `finally`). Ensure disposal still
occurs on the exception paths.
**Resolution**
_Unresolved._
### ExternalSystemGateway-006 — `BuildUrl` ignores path templates and appends a trailing slash for empty paths
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:180-196` |
**Description**
`BuildUrl` does `baseUrl.TrimEnd('/') + "/" + path.TrimStart('/')`. When `method.Path`
is empty (a method that targets the base URL itself), this still appends a `/`,
producing `https://host/api/` which some servers treat as a different resource than
`https://host/api`. More importantly, the design doc shows method paths as templates
like `/recipes/{id}`, but `BuildUrl` performs no placeholder substitution — a `{id}`
token is sent literally in the URL and the corresponding parameter is instead appended
as a query-string entry (for GET/DELETE) or placed in the JSON body (POST/PUT). Either
the design's path-template feature is unimplemented, or the doc is stale; in the
current code a method defined as `/recipes/{id}` will never produce a correct URL.
**Recommendation**
Decide whether path templating is in scope. If yes, implement `{name}` substitution
from `parameters` in `BuildUrl` and exclude substituted parameters from the query
string/body. If no, update the component design doc to remove the `/recipes/{id}`
example and state that paths are literal. Also avoid appending a trailing `/` when
`path` is empty.
**Resolution**
_Unresolved._
### ExternalSystemGateway-007 — External error response bodies are echoed verbatim into script-visible error messages
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:167-177` |
**Description**
On a non-success HTTP response, the full response body is read into `errorBody` and
embedded verbatim into the exception message (`$"HTTP {code} from {name}: {errorBody}"`),
which then flows into `ExternalCallResult.ErrorMessage` and back to the calling script,
and into Site Event Logging. An external system error page can be arbitrarily large
(an HTML stack trace, a multi-megabyte body) and may contain sensitive detail. There
is no size cap, so a hostile or misbehaving endpoint can inflate every error log entry
and error string returned to scripts. There is also no content-type check before
treating the body as text.
**Recommendation**
Truncate `errorBody` to a bounded length (e.g. 12 KB) before embedding it, and
consider logging the full body separately at debug level rather than returning it to
the script. Optionally only include the body when the content type is textual.
**Resolution**
_Unresolved._
### ExternalSystemGateway-008 — Cancellation is conflated with transient timeout failure
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ErrorClassifier.cs:24-30`, `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:157-159` |
**Description**
`ErrorClassifier.IsTransient(Exception)` returns `true` for `TaskCanceledException`
and `OperationCanceledException`. `HttpClient.SendAsync` throws `TaskCanceledException`
both when its internal timeout elapses *and* when the supplied `CancellationToken` is
cancelled (e.g. the Script Execution Actor is stopped, or the actor system is shutting
down). Because `InvokeHttpAsync`'s `catch` filter treats all of these as transient, a
caller-initiated cancellation during a `CachedCall` will be misclassified as a
transient failure and the message will be buffered for retry — work the caller
explicitly asked to abandon. For a `Call`, a shutdown-time cancellation is reported to
the script as a "Transient error" rather than an `OperationCanceledException`.
**Recommendation**
In `InvokeHttpAsync`, check `cancellationToken.IsCancellationRequested` first and
rethrow `OperationCanceledException` (or let it propagate) before applying transient
classification. Only treat a cancellation as a timeout when the supplied token is
*not* the one that was cancelled.
**Resolution**
_Unresolved._
### ExternalSystemGateway-009 — `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:109-117` |
**Description**
`CachedCallAsync` assigns the result of `_storeAndForward.EnqueueAsync(...)` to
`sfResult` and then never reads it — it unconditionally returns
`new ExternalCallResult(true, null, null, WasBuffered: true)`. `EnqueueAsync` can
return `Success == false` (a permanent failure encountered during its
immediate-delivery attempt — `StoreAndForwardService.cs:142`) or `Buffered == false`
(delivered immediately). In both cases the ESG still reports the call as buffered and
successful to the script. A permanent failure surfaced by the S&F immediate attempt is
therefore silently lost instead of being returned to the script as the design requires
("On permanent failure (HTTP 4xx), the error is returned synchronously").
**Recommendation**
Inspect `sfResult`: if `Success == false` return an error `ExternalCallResult`; set
`WasBuffered` from `sfResult.Buffered` rather than hard-coding `true`. (This finding is
partly subsumed by the dispatch redesign in finding 003.)
**Resolution**
_Unresolved._
### ExternalSystemGateway-010 — `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:48-50` |
**Description**
`GetConnectionAsync` constructs `new SqlConnection(...)` and calls `await
connection.OpenAsync(...)`. If `OpenAsync` throws (unreachable server, bad
credentials, cancellation) the just-created `SqlConnection` instance is never disposed
— the exception propagates and the local reference is lost. While an unopened
`SqlConnection` is lightweight, over many failing calls this is an avoidable leak. The
design doc says `Database.Connection()` failures return an error to the script; the
current code lets a raw `SqlException` escape, which is acceptable, but the leak is
not.
**Recommendation**
Wrap the open in a try/catch that disposes the connection before rethrowing:
`try { await connection.OpenAsync(ct); } catch { connection.Dispose(); throw; }`.
**Resolution**
_Unresolved._
### ExternalSystemGateway-011 — Every call performs a full repository scan of all systems and methods
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:231-245`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:90-97` |
**Description**
`ResolveSystemAndMethodAsync` calls `GetAllExternalSystemsAsync()` and then
`GetMethodsByExternalSystemIdAsync()` and filters in memory on every single call;
`ResolveConnectionAsync` calls `GetAllDatabaseConnectionsAsync()` and filters in memory
on every cached write / connection request. At sites this hits the SQLite repository,
and `SiteExternalSystemRepository` re-reads and re-parses the methods JSON each time.
For a hot script path this is unnecessary repeated I/O and allocation. Definitions only
change on deployment, so they are eminently cacheable.
**Recommendation**
Add an in-memory cache of system/method/connection definitions keyed by name,
invalidated on artifact deployment. Alternatively use a name-keyed repository lookup
rather than fetch-all-then-filter.
**Resolution**
_Unresolved._
### ExternalSystemGateway-012 — Permanent-failure logging requirement is not met; `_logger` is injected but unused
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:24,169-177`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:22` |
**Description**
The design doc states permanent failures are "Logged to Site Event Logging", but
`InvokeHttpAsync` performs no logging on the permanent-failure path. In fact the
injected `ILogger<ExternalSystemClient>` and `ILogger<DatabaseGateway>` fields are
never used at all in either class. Either the logging is expected to happen in the
caller (Script Execution Actor) — in which case the design doc is imprecise about
where — or it is missing. Separately, `IsTransient(HttpStatusCode)` treats any
non-success, non-(5xx/408/429) status as permanent without an explicit comment, which
is a reasonable default but undocumented.
**Recommendation**
Add a `_logger.LogWarning` on the permanent-failure path (and a debug log on
transient), or clarify in the design doc that Site Event Logging capture is the
caller's responsibility and remove the unused `_logger` fields. Add a comment in
`ErrorClassifier` documenting the "default to permanent" behaviour.
**Resolution**
_Unresolved._
### ExternalSystemGateway-013 — `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemGatewayOptions.cs:9,12`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:13` |
**Description**
`ExternalSystemGatewayOptions.MaxConcurrentConnectionsPerSystem` (default 10) and
`DefaultHttpTimeout` (default 30s) are bound from configuration but neither is read
anywhere. `AddHttpClient()` registers the default factory with no
`ConfigurePrimaryHttpMessageHandler`/`SocketsHttpHandler` `MaxConnectionsPerServer` and
no `Timeout`, so both options have no effect. An operator setting these values gets
them silently ignored — a misleading configuration surface (`DefaultHttpTimeout` is
also referenced by finding 002).
**Recommendation**
Either wire the options into a named/typed `HttpClient` registration (set
`MaxConnectionsPerServer` on the primary handler, set `Timeout`), or remove the unused
options to avoid implying behaviour that does not exist.
**Resolution**
_Unresolved._
### ExternalSystemGateway-014 — Cached-call buffering path and `DatabaseGateway` are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ExternalSystemGateway.Tests/ExternalSystemClientTests.cs:1`, (no `DatabaseGatewayTests.cs`) |
**Description**
`ExternalSystemClientTests` covers system/method not-found, success, transient 500 and
permanent 400 for `CallAsync`, plus `CachedCall` not-found and success. It does **not**
cover: the `CachedCall` transient-failure → S&F buffering branch (the most
behaviour-rich path, including the `_storeAndForward == null` fallback and `WasBuffered`
semantics), the `CachedCall` permanent-failure branch, connection-exception
classification (`HttpRequestException` thrown by the handler), `BuildUrl` query-string
construction, and `ApplyAuth` for the apikey/basic variants. There is **no test file
for `DatabaseGateway`** at all — `GetConnectionAsync` not-found, `CachedWriteAsync`
not-found, and the `_storeAndForward == null` guard are entirely uncovered. The
`MockHttpMessageHandler` also does not assert request URL/headers/body, so auth and
URL construction are unverified.
**Recommendation**
Add tests for the `CachedCall` transient/buffering paths (with a substituted S&F
service), `DatabaseGateway` not-found and null-S&F guards, and `BuildUrl`/`ApplyAuth`
by asserting on the captured `HttpRequestMessage` in the mock handler.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,420 @@
# Code Review — HealthMonitoring
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 12 |
## Summary
The HealthMonitoring module is small, readable, and broadly faithful to the design
intent: per-interval error counters with atomic read-and-reset, monotonic sequence
numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
rejection, and a 60s offline timeout. However, the review surfaced two recurring
themes. First, **a documented metric is silently unimplemented** — store-and-forward
buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a
test asserts the field is always empty), so the dashboard cannot show the buffer
depth metric the design doc requires. Second, **the central aggregator's in-memory
state model has unguarded shared mutable state**: `SiteHealthState` is a mutable
class whose fields are written by a background timer thread, by `ProcessReport`, and
by `MarkHeartbeat` with no synchronization, and the same live mutable objects are
handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also
mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate,
which the runtime may invoke more than once under contention. Additionally there are
gaps around central self-report offline detection, heartbeats for not-yet-registered
sites being dropped, and missing test coverage for the central report loop,
heartbeat path, and most collector setters. None of the findings are crash-class,
but the concurrency issues are Medium/High and the missing S&F metric is a real
design-adherence gap.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
## Findings
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` |
**Description**
`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending
messages by category) as a required monitored metric. `SiteHealthCollector` exposes
`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the
method has **no callers**`_sfBufferDepths` always remains the empty dictionary it
is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and
sets `ParkedMessageCount`, but parked count is a distinct metric from per-category
buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder`
even codifies the unimplemented state as expected behaviour. The result is that the
central dashboard cannot display buffer depth, a documented triage metric.
**Recommendation**
Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the
existing parked-count call) using the S&F engine's per-category depth API, or, if the
metric is intentionally deferred, record that decision in the design doc and remove
the dead setter. Update the placeholder test accordingly once implemented.
**Resolution**
_Unresolved._
### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` |
**Description**
`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`,
`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are
mutated from at least three concurrent contexts: `ProcessReport` (caller thread —
ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat
handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The
`ConcurrentDictionary` only protects the dictionary structure, not the objects it
stores. A heartbeat update and the offline-check can interleave on the same
`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct)
and `long` fields are not guaranteed atomic on all platforms — producing torn reads
and lost updates of `IsOnline`/`LastHeartbeatAt`.
**Recommendation**
Make state transitions atomic: either guard all reads/writes of a `SiteHealthState`
with a per-site lock, or replace `SiteHealthState` with an immutable record updated
via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is
a single atomic reference swap.
**Resolution**
_Unresolved._
### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78` |
**Description**
The update delegate passed to `AddOrUpdate` mutates the `existing` object in place
(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s
contract explicitly allows the update delegate to be invoked **more than once** under
contention (when the CAS that installs the result loses a race and is retried). Each
invocation mutates the shared object, so a concurrent report for the same site can
observe a half-applied update, and the multi-field assignment is not atomic with
respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended
"only replace if sequence is higher" guard can also be subverted because the
sequence comparison and the field writes are not a single atomic step.
**Recommendation**
Have the update delegate return a **new** `SiteHealthState` (record `with` copy)
rather than mutating `existing`, and treat the dictionary value as immutable.
Combined with HealthMonitoring-002, this makes every state transition an atomic
reference swap with no observable intermediate state.
**Resolution**
_Unresolved._
### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` |
**Description**
The heartbeat cadence that offline detection relies on is documented inconsistently.
`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt`
says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s
heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout
gives the intended grace without a single authoritative number.
**Recommendation**
Pick the correct interval (verify against the heartbeat scheduler in
`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three
comments, ideally referencing the owning component rather than restating a magic number.
**Resolution**
_Unresolved._
### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` |
**Description**
`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for
real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
only fires on genuine total loss. The synthetic `central` site, however, has no
heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the
30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and
silently skips a cycle on any exception. Consequently, a single skipped/late central
self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
spins up) leaves `central` with no signal for >60s and it is marked offline even
though the central cluster is healthy. The central card thus has no equivalent of
the "one missed report grace" the design doc grants real sites.
**Recommendation**
Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for
`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply
a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new
leader starts the report loop promptly on failover.
**Resolution**
_Unresolved._
### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` |
**Description**
The `HealthReportSender` class XML summary states "Sequence numbers are monotonic,
starting at 1, and reset on service restart." The implementation instead seeds
`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first
emitted sequence is a large epoch value, specifically to keep ordering correct across
failover. The summary is therefore stale and contradicts the code. Separately, the
seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than
through an injected `TimeProvider` (which `CentralHealthAggregator` already uses),
making the seeding logic impossible to unit-test deterministically and dependent on
node wall-clock agreement — if one node's clock lags, its post-failover reports can
be silently rejected as stale by the aggregator.
**Recommendation**
Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding
strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the
clock dependency is explicit.
**Resolution**
_Unresolved._
### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` |
**Description**
`MarkHeartbeat` returns immediately if the site is not already in `_siteStates`
("registration only happens on report"). Central health state is in-memory only and
not persisted. After a central restart or failover the aggregator starts empty, so
for up to one full report interval (default 30s) every site emits only heartbeats
that are all discarded — the site is reported as *unknown* (absent from
`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is
reachable. This is a visible dashboard regression precisely during the failover
window, which is when operators most need accurate status.
**Recommendation**
Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no
`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a
heartbeat arrives for an unknown site, so reachable sites show online immediately
after a central restart.
**Resolution**
_Unresolved._
### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116` |
**Description**
`GetAllSiteStates` copies the dictionary but the copy still holds references to the
same live mutable `SiteHealthState` instances; `GetSiteState` returns the live
instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
on their own threads while the aggregator's background timer and report handlers
concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport`
from a different update, or a torn `DateTimeOffset`. Callers could also mutate the
shared state, corrupting aggregator state.
**Recommendation**
Return immutable snapshots: convert `SiteHealthState` to a record (per
HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
state into an immutable DTO before returning.
**Resolution**
_Unresolved._
### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/` |
**Description**
Several behaviours have no automated coverage:
- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report
generation, sequence assignment: no test file at all.
- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports,
online recovery via heartbeat, and the unknown-site drop behaviour
(HealthMonitoring-007): untested.
- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the
existing offline tests only advance time after a report, never exercising the
heartbeat-keeps-alive path the design depends on.
- `SiteHealthCollector``SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`,
`SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`,
`UpdateConnectionEndpoint`: not reflected-in-report tested.
- `SiteHealthReportReplica` idempotency under double delivery: untested.
**Recommendation**
Add tests for the central report loop (with a fake `IClusterNodeProvider`), the
heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
setters' presence in `CollectReport` output.
**Resolution**
_Unresolved._
### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` |
**Description**
The cluster-nodes update and parked-message-count query are each wrapped in
`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g.
the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws)
is then completely invisible — every report silently ships with stale cluster nodes
and a parked count of 0, with nothing in the logs to explain the wrong dashboard
values. Bare `catch` with no exception variable also catches `OperationCanceledException`
and would mask shutdown signalling if the awaited call observed the token.
**Recommendation**
Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug`
the failure so persistent degradation is diagnosable; avoid swallowing
`OperationCanceledException`.
**Resolution**
_Unresolved._
### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` |
**Description**
`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for
Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
who registers it will believe actor wiring is in place. No caller currently invokes it.
**Recommendation**
Remove the method until it has real behaviour, or throw `NotImplementedException` so
accidental use fails loudly. If the actor model for this component is genuinely
planned, track it in the design doc instead of a half-method.
**Resolution**
_Unresolved._
### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` |
**Description**
`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`,
suppressing nullability. Today every code path that creates a `SiteHealthState` (only
`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the
`null!` declaration tells readers and the compiler the opposite of the real
invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
(no report yet), this becomes a live `NullReferenceException` risk for UI code that
dereferences `LatestReport`.
**Recommendation**
Either make `LatestReport` `required` (matching how it is genuinely always set today)
or make it properly nullable `SiteHealthReport?` and have consumers handle the
"registered, no report yet" case explicitly — consistent with whatever is decided
for HealthMonitoring-007.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,396 @@
# Code Review — Host
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
## Summary
The Host module is the composition root for the entire ScadaLink system: a single
binary whose behaviour (`Central` vs `Site`) is driven entirely by configuration. The
implementation is generally faithful to `Component-Host.md` — startup validation,
role-based registration, Serilog enrichment, Windows Service support, dead-letter
monitoring, CoordinatedShutdown, and gRPC hosting on site nodes are all present and
backed by a solid test suite (`tests/ScadaLink.Host.Tests`).
The most significant problem is the readiness endpoint: `/health/ready` runs **all**
registered health checks, including the leader-only `active-node` check, so a fully
operational *standby* central node permanently reports `503` on `/health/ready`
directly contradicting REQ-HOST-4a, which defines readiness as cluster membership +
DB connectivity (not leadership). Several other findings concern configuration that
is validated-but-never-consumed (`MachineDataDb`), design-doc drift (Akka.Persistence
is required by REQ-HOST-6 but the system uses no persistent actors), an incorrect
seed-node entry in the shipped site config, blocking sync-over-async during startup,
and unguarded string interpolation when building HOCON. None are crash/data-loss
class, but the readiness bug is High because it breaks load-balancer behaviour with
no safe workaround.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
## Findings
### Host-001 — `/health/ready` includes the leader-only `active-node` check
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:135-145` |
**Description**
`/health/ready` is mapped with `MapHealthChecks("/health/ready", ...)` and **no
`Predicate`**, so it executes every registered check: `database`, `akka-cluster`
*and* `active-node`. `ActiveNodeHealthCheck` (`Health/ActiveNodeHealthCheck.cs:38`)
returns `Unhealthy` on any node that is not the cluster leader. As a result a
standby central node that is fully operational (cluster member `Up`, database
reachable) still returns `503` on `/health/ready`. This contradicts REQ-HOST-4a,
which defines readiness as cluster membership + DB connectivity + singletons —
explicitly *not* leadership. `/health/active` is the endpoint intended to report
leadership. A load balancer using `/health/ready` to decide whether a node may
serve traffic will permanently treat the standby as unready, defeating failover
readiness. No test covers this: `HealthCheckTests.HealthReady_Endpoint_ReturnsResponse`
only asserts a response is returned, not the standby semantics.
**Recommendation**
Add a `Predicate` to the `/health/ready` mapping that excludes the `active-node`
check, e.g. `Predicate = check => check.Name != "active-node"` (or tag the readiness
checks and filter by tag). Add a regression test asserting a non-leader node returns
`200` on `/health/ready`.
**Resolution**
_Unresolved._
### Host-002 — Akka.Persistence required by REQ-HOST-6 is not configured and not used
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` |
**Description**
REQ-HOST-6 states the Host "must configure the Akka.NET actor system using
Akka.Hosting with ... **Persistence**: Configured with the appropriate journal and
snapshot store (SQL for central, SQLite for site)." The HOCON built in
`AkkaHostedService.StartAsync` contains no `akka.persistence` section, no journal and
no snapshot-store plugin, and `ScadaLink.Host.csproj` references neither
`Akka.Persistence.Hosting` nor any persistence plugin (the design doc Dependencies
list `Akka.Persistence.Hosting`). A repo-wide search finds **no** `PersistentActor` /
`ReceivePersistentActor` subclasses — the system deliberately uses custom SQLite
storage services instead. The code is internally consistent, but the design document
is stale: it mandates a subsystem that does not exist. This is a documented-vs-actual
drift that will mislead future maintainers and any audit against REQ-HOST-6.
**Recommendation**
Update `Component-Host.md` REQ-HOST-6 and the Dependencies list to remove the
Akka.Persistence requirement (or explicitly state persistence is provided by
component-owned SQLite storage, not Akka.Persistence). If persistence *is* intended,
add the plugin packages and HOCON. Either way, code and doc must agree.
**Resolution**
_Unresolved._
### Host-003 — Secrets committed in plaintext in `appsettings.Central.json`
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.Central.json:20-31` |
**Description**
`appsettings.Central.json` contains real-looking secrets in plaintext, checked into
source control: SQL Server passwords in the `ConfigurationDb` / `MachineDataDb`
connection strings (`Password=ScadaLink_Dev1#`), an LDAP service-account password
(`LdapServiceAccountPassword: "password"`), and a JWT signing key
(`JwtSigningKey: "scadalink-dev-jwt-signing-key-..."`). Even though these are
intended as development defaults, shipping them in the default config invites them
being reused verbatim in production, and a committed JWT signing key allows anyone
with repo access to forge session tokens. `TrustServerCertificate=true` additionally
disables TLS validation for the SQL connection.
**Recommendation**
Move all secrets out of committed `appsettings*.json` into environment variables,
user-secrets, or a secret store. Keep only non-sensitive structural defaults in the
file and document the required environment variables. At minimum add a clear comment
that these values are dev-only and must be overridden, and rotate the JWT key per
environment.
**Resolution**
_Unresolved._
### Host-004 — Site seed-node list points at the gRPC port, not a remoting port
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/appsettings.Site.json:10-19` |
**Description**
The shipped site config sets `Node:RemotingPort = 8082` and `Node:GrpcPort = 8083`,
but `Cluster:SeedNodes` is `["akka.tcp://scadalink@localhost:8082",
"akka.tcp://scadalink@localhost:8083"]`. The second seed node targets `8083`, which
is the Kestrel HTTP/2 gRPC port — not an Akka remoting endpoint. A node attempting to
join via that seed will try to establish an Akka.Remote TCP association against the
gRPC listener and fail. `StartupValidator` only checks that ≥2 seed nodes exist
(`StartupValidator.cs:54-56`), so this misconfiguration passes validation silently.
For the single-node dev site it is harmless (the first seed succeeds), but it is an
incorrect example that will be copied into multi-node site configs.
**Recommendation**
Correct the site seed-node list to reference the two site nodes' *remoting* ports
(e.g. `8082` and `8084`), never the gRPC port. Consider extending `StartupValidator`
to reject a seed node whose port equals this node's `GrpcPort`.
**Resolution**
_Unresolved._
### Host-005 — Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync`
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:345` |
**Description**
`RegisterSiteActors` calls `storeAndForwardService.StartAsync().GetAwaiter().GetResult()`
synchronously, blocking inside the `IHostedService.StartAsync` path. `StartAsync` is
itself declared synchronous (returns `Task.CompletedTask`), so the work cannot be
awaited cleanly. Blocking on async work risks thread-pool starvation during startup
and, if the awaited operation captures a synchronization context, deadlock. It also
hides exceptions behind an `AggregateException` wrapper.
**Recommendation**
Make `AkkaHostedService.StartAsync` genuinely `async` and `await
storeAndForwardService.StartAsync(cancellationToken)`. Propagate the
`CancellationToken` and let exceptions surface as the original type.
**Resolution**
_Unresolved._
### Host-006 — HOCON assembled by unescaped string interpolation
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` |
**Description**
The Akka HOCON is built with an interpolated string that injects
`_nodeOptions.NodeHostname`, `_clusterOptions.SeedNodes`, the computed roles, and
`SplitBrainResolverStrategy` directly into the configuration text. Values are not
escaped. A hostname or seed-node string containing a quote, backslash, brace, or
comment sequence would corrupt the HOCON and produce a confusing parse error far from
the real cause; `SplitBrainResolverStrategy` is interpolated without quoting, so a
value with whitespace breaks the document. Building cluster configuration from raw
string concatenation is also harder to maintain than the typed Akka.Hosting builder
the design doc (REQ-HOST-6) actually calls for ("via Akka.Hosting").
**Recommendation**
Prefer the `Akka.Hosting` `AddAkka(...)` builder with strongly-typed `WithRemoting`,
`WithClustering`, and split-brain-resolver configuration instead of hand-built HOCON.
If HOCON must be retained, validate/escape interpolated values (especially hostname
and seed nodes) before substitution.
**Resolution**
_Unresolved._
### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/StartupValidator.cs:43-47` |
**Description**
REQ-HOST-4 requires: "Site nodes must have `GrpcPort` in valid port range (165535)
**and different from `RemotingPort`**." `StartupValidator` validates the GrpcPort
range but never compares it to `RemotingPort`. A site config that sets both ports to
the same value passes validation and then fails opaquely at runtime when Kestrel and
Akka.Remote both try to bind the port. The GrpcPort range check is also skipped
entirely when the key is absent (`grpcPortStr != null`), relying on the
`NodeOptions` default of 8083 — acceptable, but the equality rule is the missing
piece.
**Recommendation**
Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including the
8083 default) equals `RemotingPort`, add an error
`"ScadaLink:Node:GrpcPort must differ from RemotingPort"`.
**Resolution**
_Unresolved._
### Host-008 — `MachineDataDb` is validated and declared but never consumed
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/StartupValidator.cs:33-34`, `src/ScadaLink.Host/DatabaseOptions.cs:6` |
**Description**
`StartupValidator` requires a non-empty `ScadaLink:Database:MachineDataDb` connection
string for Central nodes, and `DatabaseOptions` exposes a `MachineDataDb` property,
but a repo-wide search shows the value is never read anywhere outside the Host module
— only `ConfigurationDb` is passed to `AddConfigurationDatabase`
(`Program.cs:83-85`). The Host therefore fails startup if `MachineDataDb` is missing
even though nothing uses it. This is either dead configuration that should be removed
or a missing wiring (a machine-data DbContext that was never registered).
**Recommendation**
Determine whether a machine-data store is actually required. If yes, wire it into the
relevant component's DI registration. If no, remove the `MachineDataDb` validation
rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json`.
**Resolution**
_Unresolved._
### Host-009 — `StartAsync` reports success before role actors are confirmed running
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:127-141` |
**Description**
`StartAsync` creates actors with `ActorOf` (a fire-and-forget operation — the actor's
`PreStart` runs asynchronously on its own thread) and then returns
`Task.CompletedTask`. For site nodes, `grpcServer.SetReady(_actorSystem)` is called
synchronously at the end of `RegisterSiteActors`, marking the gRPC server ready even
though `SiteCommunicationActor`, the deployment-manager singleton, and the
`ClusterClient` may not yet have completed their `PreStart`/initial-contact handshake.
REQ-HOST-7 requires "Actor system and SiteStreamManager ... initialized before gRPC
begins accepting connections" — `SiteStreamManager.Initialize` is awaited-equivalent,
but the broader actor graph is not. The window is small and the gRPC server still
rejects streams until `SetReady`, so impact is limited, but readiness is being
asserted optimistically.
**Recommendation**
If strict ordering matters, gate `SetReady` on confirmation that
`SiteCommunicationActor` is fully initialized (e.g. an `Ask` round-trip or a
readiness message), or document explicitly that gRPC readiness only guarantees the
actor system exists, not that the cluster handshake has completed.
**Resolution**
_Unresolved._
### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:112-125` |
**Description**
On Central startup the Host opens a DI scope and calls
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly. If the SQL Server is not
yet reachable (common in container orchestration where the DB and app start
together), the call throws, the top-level `catch` logs `Fatal`, and the process
exits. There is no bounded retry/backoff to tolerate a database that is briefly
unavailable at boot. The design intent (REQ-HOST-4a, readiness gating, `503` until
ready) is about *serving traffic*, but the migration step happens before the host
even runs and has no such tolerance.
**Recommendation**
Wrap the migration/validation step in a bounded retry with exponential backoff (e.g.
Polly), or move schema apply behind the readiness gate so the process stays up and
reports `503` until the database becomes reachable.
**Resolution**
_Unresolved._
### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.Host/LoggingOptions.cs:5`, `src/ScadaLink.Host/Program.cs:42-50` |
**Description**
`LoggingOptions` exposes a `MinimumLevel` property bound from `ScadaLink:Logging`
(`SiteServiceRegistration.BindSharedOptions`), and both `appsettings.Central.json`
and `appsettings.Site.json` set `"Logging": { "MinimumLevel": "Information" }`.
However Serilog is configured purely via `ReadFrom.Configuration(configuration)`,
which reads the standard `Serilog` section — not `ScadaLink:Logging`. The
`LoggingOptions.MinimumLevel` value is never read by any code, so changing it has no
effect. This is misleading: an operator editing `ScadaLink:Logging:MinimumLevel`
expecting a log-level change will see nothing happen.
**Recommendation**
Either consume `LoggingOptions.MinimumLevel` when configuring the Serilog
`LoggerConfiguration` (e.g. set `MinimumLevel.Is(...)` from it), or remove the option
class and the `ScadaLink:Logging` sections and rely solely on the `Serilog`
configuration section. Keep one mechanism, not two.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,442 @@
# Code Review — InboundAPI
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.InboundAPI` |
| Design doc | `docs/requirements/Component-InboundAPI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 13 |
## Summary
The InboundAPI module is small (8 source files) and the happy-path flow — extract
key, validate, deserialize parameters, execute script, serialize result — is clean
and readable. However the review surfaced several real problems concentrated in two
themes: **concurrency** and **security**. The `InboundScriptExecutor` is a singleton
that mutates a plain `Dictionary` from concurrent ASP.NET request threads with no
synchronization, which can corrupt the handler cache or crash the process under load.
On the security side, API-key comparison is a non-constant-time database string
match (timing oracle), compiled scripts run with no enforcement of the documented
script trust model (forbidden APIs such as `System.IO`/`Process`/`Reflection` are
fully reachable), there is no request-body size limit, and the executor's catch-all
swallows `OperationCanceledException` from genuine client disconnects as a "timeout".
Design-doc adherence is also incomplete: the `Database.Connection()` script API
described in the design doc is entirely absent from `InboundScriptContext`, and the
endpoint never enforces that the API is central-only. Testing covers the validators
well but there is no coverage of the HTTP endpoint, concurrency, or recompilation.
None of the findings are data-loss-class, but the concurrency and trust-model issues
are High severity and should be addressed before production use.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `CoerceValue` returns `null` for legitimately-null/`String` values indistinguishably; parameter-definition edge cases noted. |
| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. |
| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". |
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. |
| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. |
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
## Findings
### InboundAPI-001 — Singleton script handler cache mutated without synchronization
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:17`, `:32`, `:40`, `:89`, `:123-128` |
**Description**
`InboundScriptExecutor` is registered as a singleton (`ServiceCollectionExtensions.cs:11`)
and its handler cache is a plain `Dictionary<string, Func<...>>` (`InboundScriptExecutor.cs:17`).
`RegisterHandler`, `RemoveHandler`, `CompileAndRegister`, and the lazy-compile path in
`ExecuteAsync` all read and write this dictionary with no lock. ASP.NET serves inbound
API requests on concurrent thread-pool threads, so two requests for an as-yet-uncompiled
method (or a request racing a CLI-triggered `CompileAndRegister`) can mutate the
dictionary concurrently. `Dictionary` is explicitly not safe for concurrent
read/write — this can corrupt internal buckets, throw `InvalidOperationException`,
or return a torn/`null` handler, crashing the request or the process.
**Recommendation**
Replace the `Dictionary` with a `ConcurrentDictionary<string, Func<...>>`, or guard all
access with a lock. For the lazy-compile path use `GetOrAdd` so concurrent first-callers
compile at most once.
**Resolution**
_Unresolved._
### InboundAPI-002 — Lazy compilation is a check-then-act race with no atomicity
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:123-129` |
**Description**
`ExecuteAsync` does `if (!_scriptHandlers.TryGetValue(...)) { CompileAndRegister(method); handler = _scriptHandlers[method.Name]; }`.
Even setting aside the unsynchronized dictionary (InboundAPI-001), this is a
check-then-act sequence: between `TryGetValue` failing and the re-read on line 128,
another thread could `RemoveHandler` the entry, causing the indexer on line 128 to
throw `KeyNotFoundException` — an unhandled-in-context exception that is then caught
only by the broad catch on line 143 and reported to the caller as "Internal script
error". Multiple concurrent first-callers will also each compile the same script
redundantly (wasted Roslyn work).
**Recommendation**
Make compile-and-fetch a single atomic operation (`ConcurrentDictionary.GetOrAdd`
with a lazily-evaluated factory, or a per-method lock), and have `CompileAndRegister`
return the handler it produced rather than requiring a separate dictionary read.
**Resolution**
_Unresolved._
### InboundAPI-003 — API key compared with non-constant-time string equality
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ConfigurationDatabase/Repositories/InboundApiRepository.cs:22-23`, consumed by `src/ScadaLink.InboundAPI/ApiKeyValidator.cs:33` |
**Description**
API-key authentication resolves the key with
`FirstOrDefaultAsync(k => k.KeyValue == keyValue)` — an ordinary equality match
translated to a SQL `WHERE KeyValue = @p` comparison. The secret is matched with
ordinary (early-exit) string/SQL comparison rather than a constant-time comparison,
which is a classic timing side-channel for secret material. Combined with the design's
explicit "no rate limiting" decision, an attacker with network access to the central
API can mount a timing attack to recover valid keys. The API key is the *sole*
credential for the inbound API, so this is the primary authentication path.
**Recommendation**
Look the key up by a non-secret indexed identifier (e.g. a key prefix/id) or fetch
candidate rows, then verify the secret in-process using
`CryptographicOperations.FixedTimeEquals` over the UTF-8 bytes. Preferably store only
a salted hash of the key value and compare hashes. Avoid leaking secret-length and
match-position timing.
**Resolution**
_Unresolved._
### InboundAPI-004 — Client disconnect is misreported as a script timeout
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:117-141` |
**Description**
`ExecuteAsync` creates a linked CTS from `httpContext.RequestAborted` and the method
timeout, then catches `OperationCanceledException` and unconditionally returns
"Script execution timed out". When the *client* aborts the request (`RequestAborted`
fires), the same exception type is thrown, so a normal client disconnect is logged as
a timeout (`_logger.LogWarning("Script execution timed out ...")`) and an attempt is
made to write a 500 timeout body to an already-gone connection. This pollutes the
failure log (which the design says is reserved for genuine script errors) and obscures
real timeout incidents.
**Recommendation**
Distinguish the two cancellation sources: if `cancellationToken` (the request token)
is cancelled, treat it as a client abort — do not log a timeout and do not attempt to
write a response. Only when the timeout CTS fired should the result be "timed out".
Check `cts.Token.IsCancellationRequested && !cancellationToken.IsCancellationRequested`
or use a dedicated timeout `CancellationTokenSource` so the two are separable.
**Resolution**
_Unresolved._
### InboundAPI-005 — Compiled API scripts run with no script-trust-model enforcement
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:56-93` |
**Description**
CLAUDE.md's Akka.NET conventions state the script trust model forbids `System.IO`,
`Process`, `Threading`, `Reflection`, and raw network access. `CompileAndRegister`
compiles arbitrary C# with `CSharpScript.Create` and only restricts the *default
imports* (`WithImports("System", ...)`). Imports are a convenience, not a sandbox — a
script can still fully-qualify any type (`System.IO.File.Delete(...)`,
`System.Diagnostics.Process.Start(...)`, `System.Reflection`, raw `Socket`) because
the core framework assemblies are referenced and Roslyn scripting performs no API
allow/deny-listing. Inbound API scripts execute on the central node with the host
process's privileges, so a malicious or buggy method definition has full host access.
Note the Design role authors these scripts (less trusted than Admin), making
enforcement material.
**Recommendation**
Add a compile-time analyzer/`SyntaxWalker` (as the Site Runtime does for instance
scripts) that rejects forbidden namespaces/types before registering a handler, and/or
run scripts under a constrained boundary. At minimum, share the Site Runtime's
forbidden-API checker so the trust model is enforced consistently. Reject the method
(and log) when a violation is found instead of registering it.
**Resolution**
_Unresolved._
### InboundAPI-006 — No request body size limit on the inbound endpoint
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:54-62` |
**Description**
`HandleInboundApiRequest` calls `JsonDocument.ParseAsync(httpContext.Request.Body, ...)`
with no explicit body-size cap and no `[RequestSizeLimit]`/endpoint metadata. Although
Kestrel has a default max request body size, this endpoint accepts arbitrary JSON from
external systems, fully buffers it into a `JsonDocument`, and then `Clone()`s the
root element (`:61`) which materializes the entire document on the heap. With no rate
limiting (a deliberate design choice) a single caller can drive large allocations.
Deep/wide JSON also makes the `CoerceValue` `object`/`list` deserialization
(`ParameterValidator.cs:113,117`) expensive.
**Recommendation**
Set an explicit, modest body-size limit on the endpoint
(`.WithMetadata(new RequestSizeLimitAttribute(...))` or
`IHttpMaxRequestBodySizeFeature`) and consider a `JsonDocumentOptions` `MaxDepth`.
Reject oversized bodies with 413 before buffering.
**Resolution**
_Unresolved._
### InboundAPI-007 — `Database.Connection()` script API from the design doc is not implemented
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:155-170` |
**Description**
`Component-InboundAPI.md` ("Script Runtime API -> Database Access") specifies
`Database.Connection("connectionName")` as an available script capability for
querying the configuration/machine-data databases. `InboundScriptContext` exposes only
`Parameters`, `Route`, and `CancellationToken` — there is no `Database` member. Any
method script that follows the documented API will fail to compile. Either the code
is incomplete or the design doc is stale; the two must be reconciled.
**Recommendation**
If database access is in scope, add a `Database` property to `InboundScriptContext`
backed by a connection-factory service. If it is not, remove the "Database Access"
section from `Component-InboundAPI.md` so the design doc stops advertising an absent
API.
**Resolution**
_Unresolved._
### InboundAPI-008 — Inbound API endpoint not restricted to the active central node
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/EndpointExtensions.cs:19-23`, `src/ScadaLink.Host/Program.cs:149` |
**Description**
The design states the Inbound API is "Central cluster only (active node)" and "fails
over with it". `MapInboundAPI` registers `POST /api/{methodName}` unconditionally, and
`Program.cs` maps it inside the central-role branch but with no active-node gating —
unlike `/health/active` which has an `active-node` predicate. A standby central node
will happily serve inbound API calls, executing scripts and `Route.To()` calls from a
non-leader, which can race the active node or run against stale singleton state.
**Recommendation**
Gate the endpoint on active-node status (reuse the cluster `active-node` health check
or a leader-state check) and return 503 on the standby, so Traefik/clients only reach
the live node — consistent with how the Management API and `/health/active` are
treated.
**Resolution**
_Unresolved._
### InboundAPI-009 — Failed compilation is retried on every subsequent request
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:123-128` |
**Description**
When a method's script fails to compile, `CompileAndRegister` returns `false` and
nothing is stored in `_scriptHandlers`. Every subsequent call to that method re-enters
the lazy-compile branch and recompiles the broken script via Roslyn from scratch.
Roslyn compilation is expensive; a single broken method definition repeatedly invoked
by an external caller (no rate limiting) becomes a CPU amplification vector.
**Recommendation**
Cache the compilation *failure* (e.g. store a sentinel handler that immediately
returns the compile error, or keep a `HashSet` of known-bad method names with the
diagnostic) so a broken script is compiled at most once until the definition is
updated via `CompileAndRegister`.
**Resolution**
_Unresolved._
### InboundAPI-010 — `ParameterValidator` ignores extra body fields and cannot validate Object/List element types
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/ParameterValidator.cs:64-90`, `:112-118` |
**Description**
Two related correctness gaps: (1) The validator iterates only over *defined*
parameters; any extra top-level fields in the request body are silently ignored
rather than reported, so callers get no feedback on typo'd parameter names. (2) For
`Object` and `List` types the validator only checks the JSON *kind* (`Object`/`Array`)
and then blindly `JsonSerializer.Deserialize`s the raw text — the design's extended
type system describes Objects as "named structure with typed fields" and Lists as
collections "of objects or primitive types", but no field-level or element-level type
validation is performed. Invalid nested structures pass validation and surface only
as runtime script errors.
**Recommendation**
Optionally warn/400 on unexpected body fields. For the extended types, either parse a
richer `ParameterDefinition` (with nested field definitions / element type) and
validate recursively, or document explicitly that Object/List are validated only for
shape — and update the design doc to match.
**Resolution**
_Unresolved._
### InboundAPI-011 — Method-existence check leaks to unapproved callers (enumeration oracle)
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/ApiKeyValidator.cs:39-52` |
**Description**
`ValidateAsync` returns 400 `Method '{methodName}' not found` when the method does not
exist, but 403 `API key not approved for this method` when it exists but the key is
not approved. A caller holding any valid enabled key can therefore enumerate which
method names exist on the central API by observing 400-vs-403 responses. The error
message also echoes the caller-supplied `methodName` back verbatim into the JSON
response (`EndpointExtensions.cs:47`), a minor reflected-input concern.
**Recommendation**
Return an indistinguishable response (e.g. 403/404) for both "method not found" and
"key not approved" so existence is not observable to unapproved callers. Avoid echoing
raw caller input in error bodies, or sanitize it.
**Resolution**
_Unresolved._
### InboundAPI-012 — `ParameterDefinition` POCO declared in the component project, not Commons
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/ParameterValidator.cs:128-133` |
**Description**
`ParameterDefinition` is a persistence-/contract-shaped POCO: it is the deserialized
form of `ApiMethod.ParameterDefinitions` (a column in the configuration database) and
describes the public API contract. CLAUDE.md's code-organization rules place
persistence-ignorant entity/contract types in `ScadaLink.Commons`. Defining it inside
the InboundAPI project means any other component that needs to read or produce method
parameter definitions (e.g. Central UI's method editor, CLI, Management Service)
cannot share the type and will duplicate it.
**Recommendation**
Move `ParameterDefinition` (and a matching return-definition type, if added) to
`ScadaLink.Commons` under the InboundApi entity/types namespace so it is shared by all
components that work with method definitions.
**Resolution**
_Unresolved._
### InboundAPI-013 — `ApiKeyValidationResult.NotFound` factory returns HTTP 400, contradicting its name
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/ApiKeyValidator.cs:78-79` |
**Description**
The static factory is named `NotFound` and is used for the "method not found" case,
but it builds a result with `StatusCode = 400` (Bad Request), not 404. The name
strongly implies 404 and will mislead future maintainers; `EndpointExtensions`
faithfully propagates whatever status code the factory sets, so the misnaming directly
affects the wire contract.
**Recommendation**
Rename the factory to match its behaviour (e.g. `BadRequest`) or change the status
code to 404 if that is the intended contract — and document the chosen "method not
found" status in `Component-InboundAPI.md`'s Error Handling section, which currently
does not list it.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,432 @@
# Code Review — ManagementService
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.ManagementService` |
| Design doc | `docs/requirements/Component-ManagementService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 13 |
## Summary
The ManagementService module is a thin command-dispatch layer: a single `ManagementActor`
fronts every administrative operation, an HTTP `POST /management` endpoint authenticates and
forwards to it, and a SignalR `DebugStreamHub` provides real-time debug streaming. The code
is consistently structured and the role-based authorization gate (`GetRequiredRole`) is
broadly correct and well tested. However, the review surfaced a significant **security
theme**: site-scope enforcement, which the design document requires for instance- and
site-targeted Deployment operations, is applied inconsistently — several query handlers and
all remote-query/debug handlers perform no site-scope check at all, allowing a site-scoped
Deployment user to read or act on sites outside their scope. A second theme is **Akka.NET
convention drift**: the actor offloads all work to `Task.Run` instead of using `PipeTo`,
declares no supervision strategy, and the contract messages carry a loosely-typed `object`
payload. There are also resource-management defects in the HTTP endpoint (`JsonDocument`
instances never disposed) and dead/unused configuration. None of the findings are
crash-class, but the site-scope gaps are High severity because they are a real
authorization bypass with no workaround.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | + | `HandleResolveRoles` builds `RoleMapper` by hand; `ResolveRolesCommand` is a stale dispatch path. See 008, 011. |
| 2 | Akka.NET conventions | + | `Task.Run` instead of `PipeTo`, no supervision strategy, `object`-typed message payload. See 004, 005, 012. |
| 3 | Concurrency & thread safety | + | Actor is stateless so `Task.Run` does not corrupt state, but it defeats actor-thread serialization (004). `Sender` correctly captured to a local before the closure. |
| 4 | Error handling & resilience | + | Exceptions are caught and mapped uniformly; `SiteScopeViolationException` mapped to `Unauthorized`. Audit-logging consistency issue noted in 009. |
| 5 | Security | + | Site-scope enforcement missing on query/remote/debug paths. See 001, 002, 003. |
| 6 | Performance & resource management | + | `JsonDocument` instances never disposed in the HTTP endpoint. See 006. |
| 7 | Design-document adherence | + | Design doc states remote queries enforce site scoping; code does not. `ManagementServiceOptions` reserved-for-future config is unused. See 001, 010. |
| 8 | Code organization & conventions | + | Mixed serializers (Newtonsoft in actor, System.Text.Json in endpoint); inconsistent audit logging across mutations. See 007, 009. |
| 9 | Testing coverage | + | Authorization is well covered; site-scope enforcement, the HTTP endpoint, `DebugStreamHub`, and remote-query handlers have no tests. See 013. |
| 10 | Documentation & comments | + | XML docs are accurate where present; `ManagementServiceOptions` and `ResolveRolesCommand` paths are undocumented dead code (010, 011). |
## Findings
### ManagementService-001 — Remote-query and debug-snapshot handlers bypass site-scope enforcement
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:1465`, `:1481`, `:1493`, `:641`, `:649` |
**Description**
The design document (`Component-ManagementService.md`, Authorization section) states that for
Deployment users "Site scoping is enforced for site-scoped Deployment users" and lists
"debug snapshot, parked message queries, site event log queries" among the Deployment-role
operations. `HandleQueryEventLogs`, `HandleQueryParkedMessages`, `HandleDebugSnapshot`,
`HandleRetryParkedMessage`, and `HandleDiscardParkedMessage` make no call to `EnforceSiteScope`
or `EnforceSiteScopeForInstance`. A Deployment user scoped to site A can therefore query event
logs / parked messages of site B, retry or discard another site's parked messages, and pull a
debug snapshot of any instance simply by supplying a different `SiteIdentifier` or `InstanceId`.
This is an authorization bypass with no workaround.
**Recommendation**
In each of these handlers resolve the target site and call site-scope enforcement before
delegating to `CommunicationService`. For the `SiteIdentifier`-keyed handlers, look up the
`Site` by identifier and enforce against `Site.Id`; for `DebugSnapshotCommand` the instance
is already loaded — call `EnforceSiteScope(user, instance.SiteId)` (which requires threading
`AuthenticatedUser` into these handlers, currently dropped).
**Resolution**
_Unresolved._
### ManagementService-002 — Single-entity query handlers leak data across site scope
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:510`, `:673`, `:733`, `:774`, `:631`, `:624` |
**Description**
`HandleListInstances` and `HandleListSites` correctly filter their results by the user's
`PermittedSiteIds`, but the single-entity query handlers do not. `HandleGetInstance`,
`HandleGetSite`, `HandleListAreas`, and `HandleGetDataConnection` fetch by ID with no
site-scope check, so a site-scoped Deployment user can read any instance, site, area tree,
or data connection by ID even though that site is excluded from their scope. The list
endpoints having a filter while the get-by-id endpoints do not is an inconsistency that
undermines the scoping model. (`HandleGetDeploymentDiff` and `HandleListInstanceAlarmOverrides`
do enforce scope, confirming the omission elsewhere is unintentional.)
**Recommendation**
Apply `EnforceSiteScopeForInstance` in `HandleGetInstance`, and `EnforceSiteScope` against
the resolved site ID in `HandleGetSite`, `HandleListAreas`, and `HandleGetDataConnection`
(for data connections, scope by the connection's `SiteId`).
**Resolution**
_Unresolved._
### ManagementService-003 — DebugStreamHub.SubscribeInstance performs no per-instance authorization
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/DebugStreamHub.cs:104` |
**Description**
`OnConnectedAsync` authenticates the WebSocket connection and verifies the caller holds the
`Deployment` role, but `SubscribeInstance(int instanceId)` accepts any instance ID and starts
a stream without checking that the authenticated user is scoped to that instance's site. A
site-scoped Deployment user can therefore subscribe to the live debug stream (attribute
values, alarm states) of an instance belonging to a site outside their scope. This is the
streaming equivalent of finding 001/002.
**Recommendation**
Resolve the instance's site inside `SubscribeInstance` and reject the subscription if the
authenticated user's permitted-site set does not include it. The authenticated identity
established in `OnConnectedAsync` must be persisted on the connection (e.g. in
`Context.Items`) so it is available to `SubscribeInstance`.
**Resolution**
_Unresolved._
### ManagementService-004 — Actor offloads work to Task.Run instead of using PipeTo
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:61` |
**Description**
`HandleEnvelope` runs every command on a thread-pool thread via `Task.Run(async () => ...)`
and replies from inside the continuation. This is the anti-pattern the project's Akka.NET
conventions warn against — the canonical approach is to start the async work and `PipeTo`
its result back to `Self`/`Sender`. Although `Sender` is correctly copied to a local before
the closure, the current code: (a) lets multiple commands execute fully concurrently with no
actor-thread serialization, so the actor provides no ordering or back-pressure guarantees
and is an actor in name only; (b) cannot be paused, supervised, or made to honour a mailbox
bound; (c) is shielded from synchronous faults only because every path is inside the
try/catch — any future code path that throws synchronously before the `Task.Run` body would
escape it.
**Recommendation**
Replace `Task.Run` with a method that returns the `Task` and `PipeTo` the mapped result
(`ManagementSuccess`/`ManagementError`/`ManagementUnauthorized`) back to the captured sender,
mapping faults in the `PipeTo` failure continuation. If genuine parallelism is desired, make
that explicit with a router/dispatcher rather than ad-hoc `Task.Run`.
**Resolution**
_Unresolved._
### ManagementService-005 — ManagementActor declares no supervision strategy
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:33` |
**Description**
The project conventions call for explicit supervision strategies (Resume for coordinator
actors). `ManagementActor` is a long-lived coordinator-style actor but overrides no
`SupervisorStrategy` and defines no `PreRestart`/`PostRestart` behaviour. In practice it
spawns no children so the default strategy is rarely exercised, but an explicit strategy
should still be declared for clarity and to match the documented convention; it also matters
if children are added later (e.g. if finding 004 introduces worker actors).
**Recommendation**
Add an explicit `protected override SupervisorStrategy SupervisorStrategy()` returning a
Resume-based strategy, consistent with other central coordinator actors.
**Resolution**
_Unresolved._
### ManagementService-006 — JsonDocument instances never disposed in the HTTP endpoint
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementEndpoints.cs:83`, `:112` |
**Description**
`JsonDocument` is `IDisposable` (it rents buffers from a pooled `ArrayPool`). `HandleRequest`
parses the request body into `doc` at line 83 and never disposes it, and line 112
(`JsonDocument.Parse("{}")`) allocates a second document inline that is also never disposed.
Every management HTTP call therefore leaks pooled buffers, increasing GC pressure and pool
churn under load.
**Recommendation**
Wrap the parsed document in `using var doc = ...`. For the empty-payload fallback, avoid
allocating a `JsonDocument` entirely — deserialize from the literal string `"{}"`/an empty
object, or restructure so the fallback path does not parse a throwaway document.
**Resolution**
_Unresolved._
### ManagementService-007 — Inconsistent and cycle-prone serialization of repository entities
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:67`; `src/ScadaLink.ManagementService/ManagementEndpoints.cs:113` |
**Description**
The actor serializes every command result with `Newtonsoft.Json` (`JsonConvert.SerializeObject`)
while the HTTP endpoint deserializes payloads with `System.Text.Json`. Beyond the
inconsistency, `JsonConvert.SerializeObject` is applied directly to EF-backed entities
returned by repositories (e.g. `Site`, `DataConnection`, `NotificationList` with a
`Recipients` collection, `Template` with children). With default Newtonsoft settings any
bidirectional navigation property produces a `JsonSerializationException` for self-referencing
loops, and even without cycles this serializes lazy/navigation state the CLI does not expect.
**Recommendation**
Standardise on one serializer (the rest of the HTTP path uses `System.Text.Json`). Serialize
explicit DTOs / projections rather than EF entities, or configure
`ReferenceLoopHandling.Ignore` and ignore navigation properties. Verify that handlers
returning rich entity graphs (`HandleGetTemplate`, `HandleUpdateNotificationList`) round-trip
correctly.
**Resolution**
_Unresolved._
### ManagementService-008 — HandleResolveRoles constructs RoleMapper manually instead of via DI
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:285` |
**Description**
Every other handler resolves its collaborators from the scoped `IServiceProvider`.
`HandleResolveRoles` instead does `new RoleMapper(sp.GetRequiredService<ISecurityRepository>())`,
bypassing DI. If `RoleMapper` ever gains a dependency, caching, or options, this hand-built
instance silently diverges from the DI-registered one. It is also inconsistent with
`ManagementEndpoints`, which resolves `RoleMapper` from DI.
**Recommendation**
Resolve `RoleMapper` via `sp.GetRequiredService<RoleMapper>()` like every other dependency.
**Resolution**
_Unresolved._
### ManagementService-009 — Audit logging applied inconsistently across mutating handlers
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:357`, `:1134`, `:1085`, `:526`, `:1275` |
**Description**
The design doc states "All mutating operations are audit logged." Some handlers call
`AuditAsync` explicitly (`HandleCreateInstance`, `HandleCreateSite`, all repository-direct
external-system/notification/security/area mutations), but the handlers that delegate to a
domain service do **not**`HandleCreateTemplate`/`HandleUpdateTemplate`/`HandleDeleteTemplate`,
all template-member handlers (`HandleAddAttribute` ... `HandleDeleteComposition`), template-folder
handlers, shared-script handlers, `HandleDeployArtifacts`, `HandleDeployInstance`,
`HandleEnableInstance`/`Disable`/`Delete`, and the instance-binding/override handlers. This is
correct only if every one of those services performs its own audit logging internally; the
mixed pattern makes that impossible to verify by reading this module and creates a real risk
of silent audit gaps for template authoring and deployment operations.
**Recommendation**
Decide on one layer that owns auditing. Either route all mutations through services that audit
internally (and remove the explicit `AuditAsync` calls here), or audit uniformly in the actor
after every successful mutation. Document the chosen contract so the inconsistency cannot
recur, and confirm template/deployment services actually audit.
**Resolution**
_Unresolved._
### ManagementService-010 — ManagementServiceOptions.CommandTimeout is defined but never used
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementServiceOptions.cs:5`; `src/ScadaLink.ManagementService/ManagementEndpoints.cs:16` |
**Description**
`ManagementServiceOptions.CommandTimeout` is bound from configuration in
`ServiceCollectionExtensions`, but no code reads it. The HTTP endpoint instead hard-codes
`AskTimeout = TimeSpan.FromSeconds(30)`. The design doc describes the options section as
"Reserved for future configuration — e.g., command timeout overrides", yet a concrete
`CommandTimeout` property already exists and is silently ignored, so an operator who sets it
in `appsettings.json` gets no effect.
**Recommendation**
Either consume `ManagementServiceOptions.CommandTimeout` in `ManagementEndpoints.HandleRequest`
(inject `IOptions<ManagementServiceOptions>`), or remove the property until it is wired up so
configuration cannot be set with no effect.
**Resolution**
_Unresolved._
### ManagementService-011 — ResolveRolesCommand dispatch path is stale dead code
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:273`, `:283` |
**Description**
The design doc states the HTTP endpoint "collapses the CLI's previous two-step flow
(ResolveRoles + actual command) into a single HTTP round-trip", and indeed `ManagementEndpoints`
performs LDAP auth and role resolution itself before dispatching. The `ResolveRolesCommand`
case in `DispatchCommand` is therefore unreachable from the HTTP path. It remains reachable
only via a raw ClusterClient sender, but a caller able to send `ResolveRolesCommand` could
enumerate role mappings for arbitrary LDAP groups with no role requirement
(`GetRequiredRole` returns null for it) — a minor information-disclosure surface for a path
the design says no longer exists.
**Recommendation**
If the two-step flow is genuinely retired, remove `ResolveRolesCommand`, its handler, and the
class. If it must remain for non-HTTP clients, document why and confirm exposing role-mapping
data unauthenticated is intended.
**Resolution**
_Unresolved._
### ManagementService-012 — ManagementEnvelope carries a loosely-typed object payload
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.Commons/Messages/Management/ManagementEnvelope.cs:7`; `src/ScadaLink.ManagementService/ManagementActor.cs:132` |
**Description**
`ManagementEnvelope.Command` is typed `object`, so the actor relies on a large open-ended
`switch` with a `NotSupportedException` default for unknown types. While the individual
command records are immutable, `object` defeats compile-time exhaustiveness — adding a new
command record produces no compiler signal that `DispatchCommand` (and `GetRequiredRole`)
need updating, and a typo or unregistered command surfaces only as a runtime exception. The
message contract is also harder to evolve safely under the additive-only rule.
**Recommendation**
Introduce a marker interface (e.g. `IManagementCommand`) implemented by every command record
and type the envelope payload as that interface. This documents the contract, lets analyzers
flag unhandled cases, and keeps `ManagementCommandRegistry`'s reflection scan precise.
**Resolution**
_Unresolved._
### ManagementService-013 — No tests for site-scope enforcement, the HTTP endpoint, or DebugStreamHub
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` |
**Description**
`ManagementActorTests` covers role-based authorization, success/error mapping, and correlation
IDs thoroughly, but several critical paths are untested: (a) site-scope enforcement —
`EnforceSiteScope`/`EnforceSiteScopeForInstance` and `SiteScopeViolationException` -> `Unauthorized`
mapping have no test, which is why the gaps in findings 001/002 went unnoticed; (b)
`ManagementEndpoints` — Basic Auth decoding, malformed-header handling, LDAP/role resolution,
command deserialization, and HTTP status mapping have zero coverage; (c) `DebugStreamHub`
authentication, subscribe/unsubscribe lifecycle, and `ManagementCommandRegistry.Resolve` are
untested. The `Envelope` test helper always passes `Array.Empty<string>()` for permitted
sites, so no test ever exercises a site-scoped user.
**Recommendation**
Add tests that exercise a site-scoped Deployment user against in-scope and out-of-scope
targets for instance and site operations, asserting `ManagementUnauthorized` on violations.
Add `WebApplicationFactory`-based tests for `ManagementEndpoints` covering auth failures,
malformed bodies, unknown commands, and the 200/400/403/401/504 mappings.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,306 @@
# Code Review — NotificationService
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.NotificationService` |
| Design doc | `docs/requirements/Component-NotificationService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 12 |
## Summary
The NotificationService module is small (6 source files) and structurally clean: it
abstracts the SMTP client behind an interface, isolates the OAuth2 token lifecycle,
and integrates with the Store-and-Forward Engine for transient-failure buffering.
However, the review surfaced several substantive defects. The most serious is that
**no Store-and-Forward delivery handler is ever registered for the `Notification`
category** — buffered notifications are persisted but never retried or delivered,
silently losing every notification that hit a transient SMTP failure. Error
classification is fragile (substring matching on exception messages) and is
applied inconsistently between `SendAsync` and `DeliverAsync`. `DeliverAsync` also
contains a resource-management bug that constructs and leaks two SMTP clients per
call. Secondary themes: the `OAuth2TokenService` singleton caches a single token
keyed to no credential identity (incorrect if multiple SMTP configs exist), several
design-doc requirements are unimplemented (connection timeout, max concurrent
connections, TLS `SSL`/`None` modes), and credentials are stored and passed as
plaintext `string` values. Test coverage exercises the happy path and the main
error branches but misses the OAuth2 delivery path, the permanent-classification
fallback in `DeliverAsync`, and concurrency on the token cache.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Double SMTP client construction; `Auto` socket option for non-TLS; `TimeoutException`/`OperationCanceledException` misclassified. |
| 2 | Akka.NET conventions | ☑ | No actors in this module (`AddNotificationServiceActors` is a no-op); delivery is a plain DI service. No Akka-specific issues. |
| 3 | Concurrency & thread safety | ☑ | `OAuth2TokenService` is a singleton with a shared mutable token cache; double-checked locking present but cache key is wrong (NS-006). |
| 4 | Error handling & resilience | ☑ | Critical: no S&F delivery handler registered for `Notification` (NS-001). Fragile substring error classification (NS-002, NS-003). |
| 5 | Security | ☑ | Credentials handled as plaintext strings; OAuth2 client secret in DB credential blob; no recipient address validation. |
| 6 | Performance & resource management | ☑ | Two `ISmtpClientWrapper` instances created per send, one leaked; connection not pooled; `MaxConcurrentConnections` unenforced. |
| 7 | Design-document adherence | ☑ | Connection timeout, max concurrent connections, and TLS `SSL`/`None` modes from the design doc are not implemented. |
| 8 | Code organization & conventions | ☑ | `SmtpPermanentException` in the wrong file; `SmtpConfiguration` POCO has non-nullable strings with no initializer (compiler-warning risk). |
| 9 | Testing coverage | ☑ | Happy path and main error branches covered; OAuth2 delivery path, `DeliverAsync` permanent fallback, and token-cache concurrency untested. |
| 10 | Documentation & comments | ☑ | XML comment on `DeliverAsync` ("Throws on failure") and the misleading "OAuth2 token refresh if needed" comment do not match behaviour. |
## Findings
### NotificationService-001 — Buffered notifications are never retried (no S&F delivery handler)
| | |
|--|--|
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:8` |
**Description**
On a transient SMTP failure the service calls `_storeAndForward.EnqueueAsync(StoreAndForwardCategory.Notification, ...)`. The Store-and-Forward Engine only delivers (immediately or on retry sweep) a category for which a delivery handler has been registered via `StoreAndForwardService.RegisterDeliveryHandler`. A repo-wide search shows the `Notification` category handler is never registered anywhere — `StoreAndForwardCategory.Notification` appears only in this module's `EnqueueAsync` call. As a result, every buffered notification falls into the `RetryMessageAsync` "No delivery handler for category" branch (`StoreAndForwardService.cs:201-204`), which logs a warning and returns without ever delivering or removing the message. Buffered notifications accumulate in SQLite forever and are never sent. This silently loses every notification that hit a transient failure, while `SendAsync` returns `Success=true, WasBuffered=true`, telling the caller the notification is safely queued. This directly violates the design doc's "integrates with the Store-and-Forward Engine for reliable delivery" guarantee.
**Recommendation**
Register a delivery handler for `StoreAndForwardCategory.Notification` during startup that deserializes the buffered payload (`ListName`, `Subject`, `Message`), re-resolves the list/recipients/SMTP config, and re-attempts `DeliverAsync`, returning `true` on success, `false` on permanent failure, and throwing on transient failure. Wire it in `AddNotificationService` or the host bootstrap. Add an integration test covering the buffer-then-retry-then-deliver round trip.
**Resolution**
_Unresolved._
### NotificationService-002 — `TimeoutException`/`OperationCanceledException` misclassified as transient
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:157-167` |
**Description**
`IsTransientSmtpError` treats `OperationCanceledException` (and its subtype `TaskCanceledException`) as a transient SMTP error. When the caller passes a `CancellationToken` that is cancelled — e.g. the Script Execution Actor is stopped, or the script times out — the resulting `OperationCanceledException` is caught by the `catch ... when (IsTransientSmtpError(ex))` clause and the notification is buffered as if SMTP had failed. A deliberate cancellation should propagate, not be silently buffered for retry. The same clause classifies any `IOException` as transient even though `IOException` covers unrelated failures (e.g. a serialization stream error). Additionally, `OperationCanceledException` raised by token cancellation in the OAuth2 path would be miscategorised the same way.
**Recommendation**
Re-throw `OperationCanceledException`/`TaskCanceledException` when `cancellationToken.IsCancellationRequested` is true rather than classifying it as transient. Narrow `IOException` handling to SMTP-specific I/O failures, or rely on MailKit's typed exceptions (`SmtpCommandException`, `SmtpProtocolException`, `ServiceNotConnectedException`) instead of broad base types.
**Resolution**
_Unresolved._
### NotificationService-003 — Error classification by substring matching on exception messages is fragile
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:144-147`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:163-166` |
**Description**
Transient/permanent classification depends on `ex.Message.Contains("5.")`, `Contains("4.")`, `Contains("550")`, `Contains("421")`, etc. This is unreliable: (a) `Message.Contains("5.")` matches any message containing the literal "5." anywhere — e.g. a host name `smtp5.example.com`, a version string, or a path — producing false permanent classification; (b) `Contains("4.")` likewise matches `"v4.0"` or an IP address octet; (c) MailKit exposes the actual SMTP status code on `SmtpCommandException.StatusCode`, which is the correct, locale-independent source of truth and is being ignored; (d) message text is culture/version-dependent and not part of any stable contract. Misclassification has real consequences: a permanent failure misread as transient floods the S&F buffer (which the design doc explicitly says must be prevented), and a transient failure misread as permanent loses the notification.
**Recommendation**
Classify on MailKit's typed exceptions and `SmtpCommandException.StatusCode` (4xx → transient, 5xx → permanent), and `SocketException`/`SmtpProtocolException`/connection-refused → transient. Remove all `Message.Contains` checks.
**Resolution**
_Unresolved._
### NotificationService-004 — `DeliverAsync` constructs two SMTP clients and leaks the used one
| | |
|--|--|
| Severity | High |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:118-119` |
**Description**
```csharp
using var client = _smtpClientFactory() as IDisposable;
var smtp = _smtpClientFactory();
```
The factory is invoked twice, creating two separate `MailKitSmtpClientWrapper` instances (each owning a real `SmtpClient` with a socket). The first instance is assigned to `client` and disposed by the `using`, but it is never used. The second instance, `smtp`, is the one actually connected, authenticated, used to send, and `DisconnectAsync`'d — but it is never `Dispose`d. `MailKitSmtpClientWrapper` implements `IDisposable` and wraps an unmanaged socket; the connected client is leaked on every send. `DisconnectAsync` closes the connection but does not dispose the `SmtpClient`. Over time this leaks sockets/handles.
**Recommendation**
Create exactly one client and dispose the one that is actually used:
`using var smtp = _smtpClientFactory();` then cast to `IDisposable` only if needed (the factory's `Func<ISmtpClientWrapper>` should ideally return a type that the `using` can dispose directly — consider having `ISmtpClientWrapper` extend `IAsyncDisposable`/`IDisposable`).
**Resolution**
_Unresolved._
### NotificationService-005 — Non-TLS path uses `SecureSocketOptions.Auto`, contradicting the requested mode
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:18`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:123` |
**Description**
`ConnectAsync` maps `useTls` to either `SecureSocketOptions.StartTls` or `SecureSocketOptions.Auto`. `useTls` is computed in `DeliverAsync` as `TlsMode == "starttls"`. So a configuration of `TlsMode = "none"` produces `useTls = false``SecureSocketOptions.Auto`, which lets MailKit opportunistically negotiate TLS — the opposite of "None". Worse, the design doc defines three TLS modes — `None`, `StartTLS`, `SSL` — but the code collapses them to a single boolean, so `SSL` (implicit TLS, typically port 465) is treated identically to `None`/`Auto` and the SSL mode is effectively unsupported. The `bool useTls` parameter cannot represent the three-state requirement.
**Recommendation**
Pass the `TlsMode` string (or a `TlsMode` enum) through to the wrapper and map explicitly: `None``SecureSocketOptions.None`, `StartTLS``SecureSocketOptions.StartTls`, `SSL``SecureSocketOptions.SslOnConnect`. Validate the configured value and reject unknown modes.
**Resolution**
_Unresolved._
### NotificationService-006 — OAuth2 token cache is keyed to nothing; wrong token returned when multiple SMTP configs exist
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/OAuth2TokenService.cs:14-15`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:30-35` |
**Description**
`OAuth2TokenService` is registered as a singleton and stores a single `_cachedToken`/`_tokenExpiry` pair. `GetTokenAsync` ignores the `credentials` argument when deciding whether the cache is valid — it only checks expiry. If two SMTP configurations with different tenant/client credentials are ever used (the repository's `GetAllSmtpConfigurationsAsync` returns a list, implying multiple configs are possible), the second caller receives the first caller's token, which will fail authentication against the second tenant. Even with a single config today this is a latent correctness bug and makes the service's behaviour depend on call order.
**Recommendation**
Key the cache by the credential identity (e.g. a dictionary keyed by `tenantId:clientId`, or by a hash of the credential string), or document and enforce the single-SMTP-config invariant. Given the design doc says one SMTP config is deployed per site, enforcing the invariant is acceptable but should be explicit.
**Resolution**
_Unresolved._
### NotificationService-007 — Connection timeout and max-concurrent-connections from the design doc are not implemented
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:11-14`, `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:16-20`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:111-140` |
**Description**
The design doc specifies an SMTP "Connection timeout (default 30s)" and "Max concurrent connections (default 5)", and `NotificationOptions`/`SmtpConfiguration` both carry these fields. Neither is enforced: `MailKitSmtpClientWrapper.ConnectAsync` never sets `SmtpClient.Timeout`, so the connection relies on MailKit's default timeout rather than the configured value (only the caller's `CancellationToken` bounds it, and callers may pass `default`). There is no semaphore or other throttle limiting concurrent SMTP connections per site, so `MaxConcurrentConnections` has no effect. Both options exist but are dead configuration.
**Recommendation**
Set `SmtpClient.Timeout` from `ConnectionTimeoutSeconds` in `ConnectAsync` (and/or derive a linked `CancellationTokenSource`). Introduce a `SemaphoreSlim(MaxConcurrentConnections)` gating `DeliverAsync`. If these limits are intentionally deferred, mark the options `[Obsolete]`/document them as not-yet-enforced and note the gap in the design doc.
**Resolution**
_Unresolved._
### NotificationService-008 — Recipient email addresses are not validated before send
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:136-137`, `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:50-53` |
**Description**
`SendAsync` builds `bccAddresses` directly from `recipient.EmailAddress` and passes them to `MailboxAddress.Parse`. If any recipient row has a malformed address, `MailboxAddress.Parse` throws `ParseException`. `ParseException` is not a `TimeoutException`/`SocketException`/`IOException` and its message will not generally contain "4." or "5.", so it falls through `DeliverAsync`'s outer `catch ... when (... && !IsTransientSmtpError(ex))` filter, which re-throws it (`:153`); it then escapes `SendAsync` entirely as an unhandled exception (the `SendAsync` catch blocks only cover `SmtpPermanentException` and transient errors). A single bad address in a list therefore crashes the send with an exception type the calling script is not told to expect, instead of producing a clean `NotificationResult` error. The same applies to a malformed `FromAddress`.
**Recommendation**
Validate addresses up front (e.g. `MailboxAddress.TryParse`) and return a `NotificationResult(false, ...)` listing invalid recipients, or wrap `DeliverAsync` so any non-classified exception becomes a permanent `NotificationResult` failure rather than escaping. Consider validating addresses at definition time in the Central UI as well.
**Resolution**
_Unresolved._
### NotificationService-009 — Credentials handled as plaintext strings; OAuth2 client secret logged risk
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:127-134`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:30-65`, `src/ScadaLink.Commons/Entities/Notifications/SmtpConfiguration.cs:9` |
**Description**
SMTP credentials — Basic Auth `user:pass` and OAuth2 `tenantId:clientId:clientSecret` — are stored and passed as a single colon-delimited plaintext `string` (`SmtpConfiguration.Credentials`). There is no indication the value is encrypted at rest in SQLite or in the central config DB. The colon-delimited packing is also brittle: a password or client secret containing a `:` will be split incorrectly (`Split(':', 2)` / `Split(':', 3)`), silently corrupting the secret. Separately, while the current code does not log the secret directly, the substring-based error classification logs full exception messages (`_logger.LogWarning(ex, ...)`, `LogError(ex, ...)`) and MailKit exceptions can echo back server responses; an authentication failure message could surface credential fragments into logs. There is no defensive scrubbing.
**Recommendation**
Store credentials encrypted at rest (DPAPI/Data Protection or a secret store) and model them as structured fields rather than a colon-packed string, so secrets containing `:` are safe. Ensure credential values are never written to logs; consider a redaction step on exception messages before logging.
**Resolution**
_Unresolved._
### NotificationService-010 — `DeliverAsync` does not disconnect the SMTP client on failure
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:121-154` |
**Description**
`DisconnectAsync` is only called at `:139`, on the success path inside the `try` block. If `AuthenticateAsync` or `SendAsync` throws, control jumps to the `catch` filter at `:141` and the method exits (re-throwing or wrapping) without ever calling `DisconnectAsync`. Combined with NS-004 (the client is never disposed either), a failed send leaves an open, authenticated SMTP connection until the socket is eventually reclaimed by finalization. Under sustained transient failures this can exhaust the SMTP server's connection slots.
**Recommendation**
Move disconnect/dispose into a `finally` block (or use `await using` once `ISmtpClientWrapper` supports `IAsyncDisposable`) so the connection is always torn down regardless of outcome.
**Resolution**
_Unresolved._
### NotificationService-011 — `SmtpPermanentException` declared in the wrong file; module conventions
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:173-177`, `src/ScadaLink.Commons/Entities/Notifications/SmtpConfiguration.cs:5-15` |
**Description**
Two minor convention issues. (1) `SmtpPermanentException` is a public exception type declared at the bottom of `NotificationDeliveryService.cs` rather than in its own file (`SmtpPermanentException.cs`), which is inconsistent with the one-type-per-file layout used elsewhere and makes it harder to locate. (2) `SmtpConfiguration` (a Commons POCO) declares non-nullable `string` properties (`Host`, `AuthType`, `FromAddress`) that are only guaranteed by the constructor; EF Core materialization or object-initializer use can leave them null while the type system says otherwise. These are persistence-ignorant POCO concerns but worth flagging because the delivery service dereferences `config.Host`, `config.AuthType`, `config.FromAddress` without null checks.
**Recommendation**
Move `SmtpPermanentException` to its own file. For `SmtpConfiguration`, either keep the constructor as the only path and document it, or use `required` members so the compiler enforces initialization.
**Resolution**
_Unresolved._
### NotificationService-012 — Test coverage gaps: OAuth2 delivery path, permanent-classification fallback, token-cache concurrency
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.NotificationService.Tests/NotificationDeliveryServiceTests.cs`, `tests/ScadaLink.NotificationService.Tests/OAuth2TokenServiceTests.cs` |
**Description**
The tests cover the happy path, list-not-found, no-recipients, no-SMTP-config, permanent failure, transient-without-S&F, and transient-with-S&F buffering. Notable untested paths: (1) the OAuth2 delivery branch in `DeliverAsync:128-132` — every test uses `tokenService: null` and Basic Auth, so OAuth2 token resolution during a send is never exercised; (2) `DeliverAsync`'s permanent-classification fallback (`:144-149`) that promotes a generic exception whose message contains "550"/"553"/"554" to `SmtpPermanentException` is never tested; (3) `OAuth2TokenServiceTests` never tests concurrent `GetTokenAsync` calls (the double-checked-locking path) or token expiry/refresh — the cache test uses a 3600s token so refresh never triggers; (4) no test covers the transient-with-S&F path actually delivering after retry (which would also have caught NS-001). Given NS-001 is a critical defect, the absence of an end-to-end buffer-and-retry test is significant.
**Recommendation**
Add tests for: OAuth2-authenticated send with a mocked `OAuth2TokenService`; the `DeliverAsync` 5xx-message permanent fallback; token expiry/refresh (short `expires_in`); concurrent token acquisition; and an end-to-end buffered-notification retry once a `Notification` S&F handler is registered.
**Resolution**
_Unresolved._

330
code-reviews/README.md Normal file
View File

@@ -0,0 +1,330 @@
# Code Reviews
Comprehensive, per-module code reviews of the ScadaLink codebase. Each module (one
buildable project under `src/`) has its own folder containing a `findings.md`. This
README is the aggregated index — the single place to see all outstanding work.
## How it works
- Reviews are performed one module at a time against a fixed checklist.
- Every finding is recorded in the module's `findings.md` with a severity and status.
- Findings are **never deleted** — they are closed by changing their status, keeping
a full audit trail.
- This README aggregates every **pending** finding (`Open` / `In Progress`) across all
modules.
See **[REVIEW-PROCESS.md](REVIEW-PROCESS.md)** for the full procedure: the review
checklist, severity definitions, finding format, and how to mark items resolved.
## Layout
```
code-reviews/
├── README.md # this file — process overview + pending findings
├── REVIEW-PROCESS.md # how to perform a review and track findings
├── _template/findings.md # copy-this template for a module review
└── <Module>/findings.md # one folder per src/ project
```
## Baseline review — 2026-05-16
All 19 modules were reviewed at commit `9c60592`. This established the baseline below.
| Severity | Open findings |
|----------|---------------|
| Critical | 6 |
| High | 46 |
| Medium | 100 |
| Low | 89 |
| **Total** | **241** |
## Module Status
| Module | Review status | Last reviewed | Commit | Open (C/H/M/L) | Total |
|--------|---------------|---------------|--------|----------------|-------|
| [CentralUI](CentralUI/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/3/10/5 | 19 |
| [CLI](CLI/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/6/6 | 13 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/4/3 | 8 |
| [Commons](Commons/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/0/4/8 | 12 |
| [Communication](Communication/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/2/5/3 | 11 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/4/6 | 11 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/4/6/2 | 13 |
| [DeploymentManager](DeploymentManager/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/6/5 | 14 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/2/7/4 | 14 |
| [HealthMonitoring](HealthMonitoring/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/2/5/5 | 12 |
| [Host](Host/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/1/3/7 | 11 |
| [InboundAPI](InboundAPI/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/5/5 | 13 |
| [ManagementService](ManagementService/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/5/5 | 13 |
| [NotificationService](NotificationService/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/3/5/3 | 12 |
| [Security](Security/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/4/4 | 11 |
| [SiteEventLogging](SiteEventLogging/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/4/4/3 | 11 |
| [SiteRuntime](SiteRuntime/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/3/8/5 | 16 |
| [StoreAndForward](StoreAndForward/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 1/2/4/6 | 13 |
| [TemplateEngine](TemplateEngine/findings.md) | Reviewed | 2026-05-16 | `9c60592` | 0/5/5/4 | 14 |
## Pending Findings
All findings are currently `Open`. As findings are resolved, remove them from the
tables below (see [REVIEW-PROCESS.md](REVIEW-PROCESS.md) §5). Full detail for each
finding — description, location, recommendation — lives in the module's `findings.md`.
### Critical (6)
| ID | Module | Title |
|----|--------|-------|
| CentralUI-001 | [CentralUI](CentralUI/findings.md) | Test Run sandbox executes arbitrary C# with no trust-model enforcement |
| Communication-001 | [Communication](Communication/findings.md) | Snapshot timeout leaves orphaned bridge actor and site subscription |
| DataConnectionLayer-001 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `Task.Run` in `HandleSubscribe` mutates actor state off the actor thread |
| ExternalSystemGateway-001 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | No S&F delivery handler registered; cached calls and writes can never be delivered |
| NotificationService-001 | [NotificationService](NotificationService/findings.md) | Buffered notifications are never retried (no S&F delivery handler) |
| StoreAndForward-001 | [StoreAndForward](StoreAndForward/findings.md) | Replication to standby is never triggered by the active node |
### High (46)
| ID | Module | Title |
|----|--------|-------|
| CLI-001 | [CLI](CLI/findings.md) | `SCADALINK_FORMAT` env var and config-file format are dead; format precedence broken |
| CentralUI-002 | [CentralUI](CentralUI/findings.md) | Site-scoped Deployment permissions are issued but never enforced |
| CentralUI-003 | [CentralUI](CentralUI/findings.md) | `Console.SetOut`/`SetError` mutates process-global state across concurrent circuits |
| CentralUI-004 | [CentralUI](CentralUI/findings.md) | `CookieAuthenticationStateProvider` reads `HttpContext` for the life of the circuit |
| ClusterInfrastructure-001 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | Module implements none of its documented responsibilities |
| Communication-002 | [Communication](Communication/findings.md) | gRPC reconnect does not unsubscribe the previous stream, leaking site-side relay actors |
| Communication-003 | [Communication](Communication/findings.md) | SiteStreamGrpcClient subscription map overwritten without disposal; reconnect can cancel the wrong stream |
| ConfigurationDatabase-001 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetTemplateWithChildrenAsync` loads child templates then discards them |
| DataConnectionLayer-002 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `Restart` supervision discards all subscription state on connection-actor crash |
| DataConnectionLayer-003 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `RealOpcUaClient` callback/monitored-item dictionaries mutated without synchronization |
| DataConnectionLayer-004 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Subscribe-time tag-resolution failure leaves the connection healthy but never recovers correctly |
| DataConnectionLayer-005 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteTimeout` option is documented and configured but never applied |
| DeploymentManager-001 | [DeploymentManager](DeploymentManager/findings.md) | Unexpected exceptions leave the deployment record stuck in `InProgress` |
| DeploymentManager-002 | [DeploymentManager](DeploymentManager/findings.md) | Failure-status write uses a possibly-cancelled cancellation token |
| DeploymentManager-006 | [DeploymentManager](DeploymentManager/findings.md) | Query-the-site-before-redeploy idempotency requirement not implemented |
| ExternalSystemGateway-002 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Per-system call timeout is never applied to HTTP requests |
| ExternalSystemGateway-003 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `CachedCall` double-dispatches the HTTP request |
| HealthMonitoring-001 | [HealthMonitoring](HealthMonitoring/findings.md) | Store-and-forward buffer depth metric is never populated |
| HealthMonitoring-002 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthState` mutable fields written from multiple threads without synchronization |
| Host-001 | [Host](Host/findings.md) | `/health/ready` includes the leader-only `active-node` check |
| InboundAPI-001 | [InboundAPI](InboundAPI/findings.md) | Singleton script handler cache mutated without synchronization |
| InboundAPI-003 | [InboundAPI](InboundAPI/findings.md) | API key compared with non-constant-time string equality |
| InboundAPI-005 | [InboundAPI](InboundAPI/findings.md) | Compiled API scripts run with no script-trust-model enforcement |
| ManagementService-001 | [ManagementService](ManagementService/findings.md) | Remote-query and debug-snapshot handlers bypass site-scope enforcement |
| ManagementService-002 | [ManagementService](ManagementService/findings.md) | Single-entity query handlers leak data across site scope |
| ManagementService-003 | [ManagementService](ManagementService/findings.md) | DebugStreamHub.SubscribeInstance performs no per-instance authorization |
| NotificationService-002 | [NotificationService](NotificationService/findings.md) | `TimeoutException`/`OperationCanceledException` misclassified as transient |
| NotificationService-003 | [NotificationService](NotificationService/findings.md) | Error classification by substring matching on exception messages is fragile |
| NotificationService-004 | [NotificationService](NotificationService/findings.md) | `DeliverAsync` constructs two SMTP clients and leaks the used one |
| Security-001 | [Security](Security/findings.md) | StartTLS upgrade path is unreachable dead code |
| Security-002 | [Security](Security/findings.md) | Authentication cookie is not marked `Secure` |
| Security-003 | [Security](Security/findings.md) | JWT signing key length is never validated |
| SiteEventLogging-001 | [SiteEventLogging](SiteEventLogging/findings.md) | `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space |
| SiteEventLogging-002 | [SiteEventLogging](SiteEventLogging/findings.md) | Storage-cap purge deletes the entire table when space is not reclaimed |
| SiteEventLogging-003 | [SiteEventLogging](SiteEventLogging/findings.md) | Shared `SqliteConnection` used by purge and query without the write lock |
| SiteEventLogging-004 | [SiteEventLogging](SiteEventLogging/findings.md) | Event-log handler runs as a cluster singleton that can land on the standby node |
| SiteRuntime-001 | [SiteRuntime](SiteRuntime/findings.md) | `Instance.SetAttribute` never writes to the Data Connection Layer |
| SiteRuntime-002 | [SiteRuntime](SiteRuntime/findings.md) | `RouteInboundApiSetAttributes` always treats writes as static overrides |
| SiteRuntime-003 | [SiteRuntime](SiteRuntime/findings.md) | Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name |
| StoreAndForward-002 | [StoreAndForward](StoreAndForward/findings.md) | Messages enqueued with no registered handler are buffered but never deliverable |
| StoreAndForward-003 | [StoreAndForward](StoreAndForward/findings.md) | Off-by-one in retry accounting: immediate failure pre-counts as retry 1 |
| TemplateEngine-001 | [TemplateEngine](TemplateEngine/findings.md) | Deeply nested composed members are dropped during flattening |
| TemplateEngine-002 | [TemplateEngine](TemplateEngine/findings.md) | Derived templates omit all base alarms; composed alarms cannot be overridden per slot |
| TemplateEngine-003 | [TemplateEngine](TemplateEngine/findings.md) | `UpdateAttributeAsync` lets a non-locked attribute change its fixed DataType / DataSourceReference |
| TemplateEngine-004 | [TemplateEngine](TemplateEngine/findings.md) | Alarm on-trigger script references are never resolved (empty placeholder) |
| TemplateEngine-005 | [TemplateEngine](TemplateEngine/findings.md) | Collision validation is skipped when creating a child template |
### Medium (100)
| ID | Module | Title |
|----|--------|-------|
| CLI-002 | [CLI](CLI/findings.md) | Empty success body crashes table rendering with an unhandled exception |
| CLI-003 | [CLI](CLI/findings.md) | Non-JSON success body crashes table rendering |
| CLI-004 | [CLI](CLI/findings.md) | Malformed `--url` throws an unhandled `UriFormatException` |
| CLI-005 | [CLI](CLI/findings.md) | Malformed `--bindings` / `--overrides` JSON throws unhandled exceptions |
| CLI-006 | [CLI](CLI/findings.md) | Password is passed as a command-line argument with no safer alternative |
| CLI-007 | [CLI](CLI/findings.md) | `Component-CLI.md` command surface is substantially stale |
| CentralUI-005 | [CentralUI](CentralUI/findings.md) | Session expiry implementation diverges from the documented policy |
| CentralUI-006 | [CentralUI](CentralUI/findings.md) | Deployment status page polls every 10s despite the documented SignalR-push design |
| CentralUI-007 | [CentralUI](CentralUI/findings.md) | Monitoring nav links to Deployment-only pages are shown to all roles |
| CentralUI-008 | [CentralUI](CentralUI/findings.md) | Audit-log date filters treat browser-local datetimes as UTC |
| CentralUI-009 | [CentralUI](CentralUI/findings.md) | `DebugView` stream callbacks touch a possibly-disposed `ToastNotification` |
| CentralUI-010 | [CentralUI](CentralUI/findings.md) | `ToastNotification` auto-dismiss continuation runs after component disposal |
| CentralUI-011 | [CentralUI](CentralUI/findings.md) | `DiffDialog` leaves a dangling `TaskCompletionSource` when disposed while open |
| CentralUI-012 | [CentralUI](CentralUI/findings.md) | N+1 query loading data connections for the Sites page |
| CentralUI-013 | [CentralUI](CentralUI/findings.md) | `ScriptAnalysisService` blocks on async shared-script lookups |
| CentralUI-014 | [CentralUI](CentralUI/findings.md) | Test Run side effects (HTTP/SQL/SMTP) fire against production services |
| ClusterInfrastructure-002 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | No-op DI extension methods report success while doing nothing |
| ClusterInfrastructure-003 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | ClusterOptions omits several documented node-configuration settings |
| ClusterInfrastructure-004 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | ClusterOptions has no validation despite safety-critical values |
| ClusterInfrastructure-006 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | No tests for any cluster behaviour; only the options POCO is covered |
| Commons-001 | [Commons](Commons/findings.md) | `StaleTagMonitor` stale-fire race between timer and `OnValueReceived` |
| Commons-002 | [Commons](Commons/findings.md) | `DynamicJsonElement` retains a `JsonElement` whose `JsonDocument` lifetime it does not own |
| Commons-003 | [Commons](Commons/findings.md) | `ScriptParameters.GetNullable` silently swallows conversion failures |
| Commons-004 | [Commons](Commons/findings.md) | `ManagementCommandRegistry` name mapping is asymmetric and namespace-scoped |
| Communication-004 | [Communication](Communication/findings.md) | Coordinator actors declare no SupervisorStrategy (design requires Resume) |
| Communication-005 | [Communication](Communication/findings.md) | gRPC keepalive and max-stream-lifetime options are defined but never applied |
| Communication-006 | [Communication](Communication/findings.md) | Site address load failures are silently swallowed, leaving a stale cache |
| Communication-007 | [Communication](Communication/findings.md) | `SiteStreamGrpcClientFactory.Dispose` blocks on async work (sync-over-async) |
| Communication-008 | [Communication](Communication/findings.md) | Reconnect retry-count reset can mask a flapping stream indefinitely |
| ConfigurationDatabase-002 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Hardcoded `sa` connection string with embedded password literal |
| ConfigurationDatabase-003 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | No-arg `AddConfigurationDatabase()` silently registers nothing |
| ConfigurationDatabase-004 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-bearing columns stored in plaintext with no protection |
| ConfigurationDatabase-007 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `AuditService` does not handle JSON-serialization failure of arbitrary `afterState` |
| DataConnectionLayer-006 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Health quality counters not reset/recomputed after failover or re-subscribe |
| DataConnectionLayer-007 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `ReadBatchAsync` aborts the whole batch on the first failing tag |
| DataConnectionLayer-009 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Implemented failover heuristic diverges from the documented state machine |
| DataConnectionLayer-010 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Tag-resolution retry can issue duplicate concurrent subscribe attempts |
| DataConnectionLayer-011 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Stale subscription callbacks from disposed adapters can still reach the actor |
| DataConnectionLayer-012 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `AutoAcceptUntrustedCerts` defaults to `true`, accepting any server certificate |
| DeploymentManager-003 | [DeploymentManager](DeploymentManager/findings.md) | Successful-deployment cleanup is not atomic with the status write |
| DeploymentManager-004 | [DeploymentManager](DeploymentManager/findings.md) | Site-success but central-delete-failure leaves orphaned site config |
| DeploymentManager-005 | [DeploymentManager](DeploymentManager/findings.md) | `OperationLockManager` leaks a `SemaphoreSlim` per instance name |
| DeploymentManager-007 | [DeploymentManager](DeploymentManager/findings.md) | "Diff View" reduced to a hash comparison with no diff detail |
| DeploymentManager-008 | [DeploymentManager](DeploymentManager/findings.md) | `DeploymentManagerOptions` is never bound to configuration |
| DeploymentManager-011 | [DeploymentManager](DeploymentManager/findings.md) | Tests never exercise a successful deployment or lifecycle success path |
| ExternalSystemGateway-004 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | System retry settings are not honoured for cached calls/writes |
| ExternalSystemGateway-005 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpRequestMessage` and `HttpResponseMessage` are not disposed |
| ExternalSystemGateway-006 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` ignores path templates and appends a trailing slash for empty paths |
| ExternalSystemGateway-007 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | External error response bodies are echoed verbatim into script-visible error messages |
| ExternalSystemGateway-008 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Cancellation is conflated with transient timeout failure |
| ExternalSystemGateway-009 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `StoreAndForwardResult` from `EnqueueAsync` is discarded; permanent failures during buffering are swallowed |
| ExternalSystemGateway-010 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `GetConnectionAsync` leaks the `SqlConnection` when `OpenAsync` fails |
| HealthMonitoring-003 | [HealthMonitoring](HealthMonitoring/findings.md) | Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate |
| HealthMonitoring-005 | [HealthMonitoring](HealthMonitoring/findings.md) | Central self-report site can flap offline; no heartbeat grace like real sites |
| HealthMonitoring-007 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeats for not-yet-registered sites are silently dropped |
| HealthMonitoring-008 | [HealthMonitoring](HealthMonitoring/findings.md) | `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers |
| HealthMonitoring-009 | [HealthMonitoring](HealthMonitoring/findings.md) | Missing test coverage for central report loop, heartbeat path, replication, and collector setters |
| Host-002 | [Host](Host/findings.md) | Akka.Persistence required by REQ-HOST-6 is not configured and not used |
| Host-003 | [Host](Host/findings.md) | Secrets committed in plaintext in `appsettings.Central.json` |
| Host-004 | [Host](Host/findings.md) | Site seed-node list points at the gRPC port, not a remoting port |
| InboundAPI-002 | [InboundAPI](InboundAPI/findings.md) | Lazy compilation is a check-then-act race with no atomicity |
| InboundAPI-004 | [InboundAPI](InboundAPI/findings.md) | Client disconnect is misreported as a script timeout |
| InboundAPI-006 | [InboundAPI](InboundAPI/findings.md) | No request body size limit on the inbound endpoint |
| InboundAPI-007 | [InboundAPI](InboundAPI/findings.md) | `Database.Connection()` script API from the design doc is not implemented |
| InboundAPI-008 | [InboundAPI](InboundAPI/findings.md) | Inbound API endpoint not restricted to the active central node |
| ManagementService-004 | [ManagementService](ManagementService/findings.md) | Actor offloads work to Task.Run instead of using PipeTo |
| ManagementService-006 | [ManagementService](ManagementService/findings.md) | JsonDocument instances never disposed in the HTTP endpoint |
| ManagementService-007 | [ManagementService](ManagementService/findings.md) | Inconsistent and cycle-prone serialization of repository entities |
| ManagementService-009 | [ManagementService](ManagementService/findings.md) | Audit logging applied inconsistently across mutating handlers |
| ManagementService-013 | [ManagementService](ManagementService/findings.md) | No tests for site-scope enforcement, the HTTP endpoint, or DebugStreamHub |
| NotificationService-005 | [NotificationService](NotificationService/findings.md) | Non-TLS path uses `SecureSocketOptions.Auto`, contradicting the requested mode |
| NotificationService-006 | [NotificationService](NotificationService/findings.md) | OAuth2 token cache is keyed to nothing; wrong token returned when multiple SMTP configs exist |
| NotificationService-007 | [NotificationService](NotificationService/findings.md) | Connection timeout and max-concurrent-connections from the design doc are not implemented |
| NotificationService-008 | [NotificationService](NotificationService/findings.md) | Recipient email addresses are not validated before send |
| NotificationService-009 | [NotificationService](NotificationService/findings.md) | Credentials handled as plaintext strings; OAuth2 client secret logged risk |
| Security-004 | [Security](Security/findings.md) | Search filter uses `uid=` while fallback DN construction uses `cn=` |
| Security-005 | [Security](Security/findings.md) | DN injection in the no-service-account bind fallback |
| Security-006 | [Security](Security/findings.md) | JWT validation disables issuer and audience checks |
| Security-007 | [Security](Security/findings.md) | Idle-timeout claim is reset on every token refresh |
| SiteEventLogging-005 | [SiteEventLogging](SiteEventLogging/findings.md) | `LogEventAsync` performs synchronous disk I/O on the caller's thread |
| SiteEventLogging-007 | [SiteEventLogging](SiteEventLogging/findings.md) | `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection |
| SiteEventLogging-008 | [SiteEventLogging](SiteEventLogging/findings.md) | Event-recording write failures are silently swallowed |
| SiteEventLogging-010 | [SiteEventLogging](SiteEventLogging/findings.md) | Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path |
| SiteRuntime-004 | [SiteRuntime](SiteRuntime/findings.md) | `_totalDeployedCount` is incremented on redeployment of an existing instance |
| SiteRuntime-005 | [SiteRuntime](SiteRuntime/findings.md) | Deployment reports `Success` to central before persistence completes |
| SiteRuntime-006 | [SiteRuntime](SiteRuntime/findings.md) | Site-local repositories read `SiteStorageService` private field via reflection |
| SiteRuntime-007 | [SiteRuntime](SiteRuntime/findings.md) | Synthetic entity IDs use the non-deterministic `string.GetHashCode()` |
| SiteRuntime-008 | [SiteRuntime](SiteRuntime/findings.md) | Blocking `.GetAwaiter().GetResult()` on the actor thread during startup |
| SiteRuntime-009 | [SiteRuntime](SiteRuntime/findings.md) | Script execution actors run scripts on the default thread pool, not a dedicated dispatcher |
| SiteRuntime-010 | [SiteRuntime](SiteRuntime/findings.md) | `EnsureDclConnections` never updates a connection whose configuration changed |
| SiteRuntime-011 | [SiteRuntime](SiteRuntime/findings.md) | Trust-model validation is a substring scan and is both over- and under-inclusive |
| StoreAndForward-004 | [StoreAndForward](StoreAndForward/findings.md) | `RegisterDeliveryHandler` XML doc contradicts the implemented contract |
| StoreAndForward-005 | [StoreAndForward](StoreAndForward/findings.md) | Parked-message retry/discard can race with the in-progress retry sweep |
| StoreAndForward-010 | [StoreAndForward](StoreAndForward/findings.md) | Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified |
| StoreAndForward-013 | [StoreAndForward](StoreAndForward/findings.md) | Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge |
| TemplateEngine-006 | [TemplateEngine](TemplateEngine/findings.md) | Forbidden-API enforcement is a naive substring scan (bypassable and false-positive prone) |
| TemplateEngine-007 | [TemplateEngine](TemplateEngine/findings.md) | Brace-balance "compilation" misjudges verbatim / interpolated / raw strings |
| TemplateEngine-008 | [TemplateEngine](TemplateEngine/findings.md) | `SetAlarmOverrideAsync` accepts overrides for unknown / composed alarms with no validation |
| TemplateEngine-009 | [TemplateEngine](TemplateEngine/findings.md) | N+1 query in `TemplateDeletionService.CanDeleteTemplateAsync` |
| TemplateEngine-010 | [TemplateEngine](TemplateEngine/findings.md) | `InstanceService` documents optimistic concurrency that is not implemented |
### Low (89)
| ID | Module | Title |
|----|--------|-------|
| CLI-008 | [CLI](CLI/findings.md) | `--format` value is not validated |
| CLI-009 | [CLI](CLI/findings.md) | Exit-code documentation does not match `HandleResponse` behaviour |
| CLI-010 | [CLI](CLI/findings.md) | `debug stream` reports Ctrl+C during connect as a connection failure |
| CLI-011 | [CLI](CLI/findings.md) | `CancellationTokenSource` in `debug stream` is never disposed |
| CLI-012 | [CLI](CLI/findings.md) | `debug stream` exit code is unreliable after stream termination |
| CLI-013 | [CLI](CLI/findings.md) | HTTP client, `debug stream`, and JSON-argument parsing are untested |
| CentralUI-015 | [CentralUI](CentralUI/findings.md) | `DialogService` continuations resolve off the render thread |
| CentralUI-016 | [CentralUI](CentralUI/findings.md) | Pagers render one button per page with no windowing |
| CentralUI-017 | [CentralUI](CentralUI/findings.md) | `/auth/logout` POST disables antiforgery, enabling logout CSRF |
| CentralUI-018 | [CentralUI](CentralUI/findings.md) | Broad `catch {}` blocks swallow JS interop and storage errors silently |
| CentralUI-019 | [CentralUI](CentralUI/findings.md) | Sparse unit-test coverage for a large module; critical paths untested |
| ClusterInfrastructure-005 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | No configuration section name constant for the Options pattern binding |
| ClusterInfrastructure-007 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | ClusterOptions lacks XML documentation comments |
| ClusterInfrastructure-008 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | "Phase 0 skeleton" status is undocumented at the module level |
| Commons-005 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` discards malformed legacy input and over-reports `IsLegacy` |
| Commons-006 | [Commons](Commons/findings.md) | `DynamicJsonElement.TryConvert` reports success for unconvertible target types |
| Commons-007 | [Commons](Commons/findings.md) | Several Commons types carry non-trivial logic, stretching REQ-COM-6 |
| Commons-008 | [Commons](Commons/findings.md) | `SetConnectionBindingsCommand` uses `ValueTuple` in a wire message contract |
| Commons-009 | [Commons](Commons/findings.md) | `Component-Commons.md` is stale relative to the actual file set |
| Commons-010 | [Commons](Commons/findings.md) | Behavior-bearing Commons types have no unit tests |
| Commons-011 | [Commons](Commons/findings.md) | `Result<T>.Failure` accepts a null error string |
| Commons-012 | [Commons](Commons/findings.md) | `ValueFormatter` uses current-culture formatting without documenting it |
| Communication-009 | [Communication](Communication/findings.md) | `_siteClients` field is mutable and reassignable; cache update is not atomic on failure |
| Communication-010 | [Communication](Communication/findings.md) | `DebugStreamBridgeActor` XML doc incorrectly describes it as a "Persistent actor" |
| Communication-011 | [Communication](Communication/findings.md) | No test coverage for snapshot-timeout cleanup, address-cache failure, or gRPC reconnect leak |
| ConfigurationDatabase-005 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Audit `Id` type disagrees with the design doc |
| ConfigurationDatabase-006 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `Site.GrpcNodeAAddress` / `GrpcNodeBAddress` columns are unbounded |
| ConfigurationDatabase-008 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `GetApprovedKeysForMethodAsync` CSV parsing silently drops malformed ids |
| ConfigurationDatabase-009 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Multi-collection eager loads issue cartesian-product queries |
| ConfigurationDatabase-010 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Several repositories and `InstanceLocator` lack direct test coverage |
| ConfigurationDatabase-011 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inconsistent constructor null-guarding across repositories/services |
| DataConnectionLayer-008 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleUnsubscribe` is O(n^2) over instances and rechecks `_unresolvedTags` redundantly |
| DataConnectionLayer-013 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Misleading XML comment: `RaiseDisconnected` claims thread safety it does not provide |
| DeploymentManager-009 | [DeploymentManager](DeploymentManager/findings.md) | Misleading timeout comment on `DeleteInstanceAsync` |
| DeploymentManager-010 | [DeploymentManager](DeploymentManager/findings.md) | `SystemArtifactDeploymentRecord` does not persist the deployment ID |
| DeploymentManager-012 | [DeploymentManager](DeploymentManager/findings.md) | `LifecycleCommandTimeout` option is dead code |
| DeploymentManager-013 | [DeploymentManager](DeploymentManager/findings.md) | SMTP credentials serialized and broadcast to all sites |
| DeploymentManager-014 | [DeploymentManager](DeploymentManager/findings.md) | Dead `CreateCommand` helper in artifact tests |
| ExternalSystemGateway-011 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Every call performs a full repository scan of all systems and methods |
| ExternalSystemGateway-012 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Permanent-failure logging requirement is not met; `_logger` is injected but unused |
| ExternalSystemGateway-013 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxConcurrentConnectionsPerSystem` and `DefaultHttpTimeout` options are defined but never used |
| ExternalSystemGateway-014 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | Cached-call buffering path and `DatabaseGateway` are untested |
| HealthMonitoring-004 | [HealthMonitoring](HealthMonitoring/findings.md) | Inconsistent heartbeat interval described across XML docs |
| HealthMonitoring-006 | [HealthMonitoring](HealthMonitoring/findings.md) | Sequence seeding contradicts the doc's "starting at 1" wording and is untestable |
| HealthMonitoring-010 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` silently swallows inner failures with bare `catch {}` |
| HealthMonitoring-011 | [HealthMonitoring](HealthMonitoring/findings.md) | `AddHealthMonitoringActors` is a dead no-op placeholder |
| HealthMonitoring-012 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract |
| Host-005 | [Host](Host/findings.md) | Blocking sync-over-async (`GetAwaiter().GetResult()`) inside `StartAsync` |
| Host-006 | [Host](Host/findings.md) | HOCON assembled by unescaped string interpolation |
| Host-007 | [Host](Host/findings.md) | REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced |
| Host-008 | [Host](Host/findings.md) | `MachineDataDb` is validated and declared but never consumed |
| Host-009 | [Host](Host/findings.md) | `StartAsync` reports success before role actors are confirmed running |
| Host-010 | [Host](Host/findings.md) | No retry/backoff around startup preconditions (DB migration, readiness) |
| Host-011 | [Host](Host/findings.md) | `LoggingOptions.MinimumLevel` is dead configuration |
| InboundAPI-009 | [InboundAPI](InboundAPI/findings.md) | Failed compilation is retried on every subsequent request |
| InboundAPI-010 | [InboundAPI](InboundAPI/findings.md) | `ParameterValidator` ignores extra body fields and cannot validate Object/List element types |
| InboundAPI-011 | [InboundAPI](InboundAPI/findings.md) | Method-existence check leaks to unapproved callers (enumeration oracle) |
| InboundAPI-012 | [InboundAPI](InboundAPI/findings.md) | `ParameterDefinition` POCO declared in the component project, not Commons |
| InboundAPI-013 | [InboundAPI](InboundAPI/findings.md) | `ApiKeyValidationResult.NotFound` factory returns HTTP 400, contradicting its name |
| ManagementService-005 | [ManagementService](ManagementService/findings.md) | ManagementActor declares no supervision strategy |
| ManagementService-008 | [ManagementService](ManagementService/findings.md) | HandleResolveRoles constructs RoleMapper manually instead of via DI |
| ManagementService-010 | [ManagementService](ManagementService/findings.md) | ManagementServiceOptions.CommandTimeout is defined but never used |
| ManagementService-011 | [ManagementService](ManagementService/findings.md) | ResolveRolesCommand dispatch path is stale dead code |
| ManagementService-012 | [ManagementService](ManagementService/findings.md) | ManagementEnvelope carries a loosely-typed object payload |
| NotificationService-010 | [NotificationService](NotificationService/findings.md) | `DeliverAsync` does not disconnect the SMTP client on failure |
| NotificationService-011 | [NotificationService](NotificationService/findings.md) | `SmtpPermanentException` declared in the wrong file; module conventions |
| NotificationService-012 | [NotificationService](NotificationService/findings.md) | Test coverage gaps: OAuth2 delivery path, permanent-classification fallback, token-cache concurrency |
| Security-008 | [Security](Security/findings.md) | N+1 query loading site-scope rules in `RoleMapper` |
| Security-009 | [Security](Security/findings.md) | CancellationToken not honored inside `Task.Run` LDAP calls |
| Security-010 | [Security](Security/findings.md) | Design doc contradicts itself on Windows Integrated Authentication |
| Security-011 | [Security](Security/findings.md) | Missing tests for security-critical paths |
| SiteEventLogging-006 | [SiteEventLogging](SiteEventLogging/findings.md) | Missing indexes for severity and keyword-search query paths |
| SiteEventLogging-009 | [SiteEventLogging](SiteEventLogging/findings.md) | XML doc on `LogEventAsync` claims asynchronous behaviour |
| SiteEventLogging-011 | [SiteEventLogging](SiteEventLogging/findings.md) | Stale "Phase 4+" placeholder in `ServiceCollectionExtensions` |
| SiteRuntime-012 | [SiteRuntime](SiteRuntime/findings.md) | `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask |
| SiteRuntime-013 | [SiteRuntime](SiteRuntime/findings.md) | `HandleUnsubscribeDebugView` does nothing despite documented behaviour |
| SiteRuntime-014 | [SiteRuntime](SiteRuntime/findings.md) | Trigger-expression evaluation blocks the coordinator actor thread |
| SiteRuntime-015 | [SiteRuntime](SiteRuntime/findings.md) | `LoggerFactory` created per Instance Actor and never disposed |
| SiteRuntime-016 | [SiteRuntime](SiteRuntime/findings.md) | Short-lived execution actors, replication actor, and repositories are untested |
| StoreAndForward-006 | [StoreAndForward](StoreAndForward/findings.md) | `GetParkedMessagesAsync` count and page run without a transaction |
| StoreAndForward-007 | [StoreAndForward](StoreAndForward/findings.md) | Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees |
| StoreAndForward-008 | [StoreAndForward](StoreAndForward/findings.md) | A SQLite connection is opened and torn down on every storage call |
| StoreAndForward-009 | [StoreAndForward](StoreAndForward/findings.md) | `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe |
| StoreAndForward-011 | [StoreAndForward](StoreAndForward/findings.md) | `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled |
| StoreAndForward-012 | [StoreAndForward](StoreAndForward/findings.md) | `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons |
| TemplateEngine-011 | [TemplateEngine](TemplateEngine/findings.md) | `SortedPropertiesConverterFactory` is dead code with a misleading comment |
| TemplateEngine-012 | [TemplateEngine](TemplateEngine/findings.md) | `DataType` enum naming diverges from the design doc |
| TemplateEngine-013 | [TemplateEngine](TemplateEngine/findings.md) | `ToDictionary(t => t.Id)` throws on duplicate IDs; cycle detectors overload Id 0 as a sentinel |
| TemplateEngine-014 | [TemplateEngine](TemplateEngine/findings.md) | Template-deletion constraint logic is duplicated and divergent |

View File

@@ -0,0 +1,109 @@
# Code Review Process
This document describes how to perform a comprehensive, per-module code review of
the ScadaLink codebase and how to track findings to resolution.
A **module** is one buildable project under `src/` (e.g. `src/ScadaLink.TemplateEngine`).
Each module has its own folder under `code-reviews/` containing a single `findings.md`.
## 1. Before you start
1. Pick the module to review. Its folder is `code-reviews/<Module>/` where `<Module>`
is the project name with the `ScadaLink.` prefix stripped.
2. Identify the design context for the module:
- Its component design doc: `docs/requirements/Component-<Name>.md`.
- The relevant **Key Design Decisions** in `CLAUDE.md`.
- `docs/requirements/HighLevelReqs.md` for cross-cutting requirements.
3. Record the exact commit being reviewed: `git rev-parse --short HEAD`. Every review
is a snapshot — a finding only means something relative to a known commit.
4. Open `code-reviews/<Module>/findings.md` and fill in the header table
(reviewer, date, commit SHA).
## 2. Review checklist
Work through **every** category below for the module. A comprehensive review means
the checklist is completed even where it produces no findings — record "No issues
found" for a category rather than leaving it ambiguous.
1. **Correctness & logic bugs** — off-by-one, null handling, incorrect conditionals,
misuse of APIs, broken edge cases.
2. **Akka.NET conventions** — supervision strategies (Resume for coordinators, Stop
for short-lived actors), `Tell` for hot paths / `Ask` only at system boundaries,
message immutability, no blocking on non-blocking dispatchers, no `sender`/`this`
captured in closures (`PipeTo` instead), correlation IDs on request/response.
3. **Concurrency & thread safety** — shared mutable state, actor state mutated only
on the actor thread, race conditions, correct use of async/await.
4. **Error handling & resilience** — exception paths, store-and-forward integration,
reconnect/retry logic, failover behaviour, transient vs permanent error
classification, graceful degradation.
5. **Security** — authentication/authorization checks, input validation, the script
trust model (forbidden APIs: `System.IO`, `Process`, `Threading`, `Reflection`,
raw network), secret handling, SQL/LDAP injection, logging of sensitive data.
6. **Performance & resource management**`IDisposable` disposal, stream/connection
lifetimes, buffering and back-pressure, unnecessary allocations, N+1 queries.
7. **Design-document adherence** — does the code match `Component-<Name>.md` and the
relevant CLAUDE.md decisions? Flag both code that drifts from the design and design
docs that are now stale.
8. **Code organization & conventions** — persistence-ignorant POCO entities in
Commons, repository interfaces in Commons / implementations in ConfigurationDatabase,
namespace hierarchy, Options pattern (options classes owned by component projects),
additive-only message contract evolution.
9. **Testing coverage** — are the module's behaviours covered by tests in `tests/`?
Note untested critical paths and missing edge-case tests.
10. **Documentation & comments** — XML doc accuracy, misleading or stale comments,
undocumented non-obvious behaviour.
## 3. Recording findings
Add one entry per finding to the `## Findings` section of the module's `findings.md`,
using the entry format in [`_template/findings.md`](_template/findings.md).
- **Finding ID** — `<Module>-NNN`, numbered sequentially within the module and never
reused (e.g. `TemplateEngine-001`). IDs are permanent even after resolution.
- **Severity:**
- **Critical** — data loss, security breach, crash/deadlock, or cluster-wide outage.
- **High** — incorrect behaviour with significant impact; no safe workaround.
- **Medium** — incorrect or risky behaviour with limited impact or a workaround.
- **Low** — minor issues, style, maintainability, documentation.
- **Category** — one of the 10 checklist categories above.
- **Location** — `file:line` (clickable), or a list of locations.
- **Description** — what is wrong and why it matters.
- **Recommendation** — concrete suggested fix.
After recording findings, update the module header table (status, open-finding count)
and refresh the base README (step 5).
## 4. Marking an item resolved
Findings are **never deleted** — they are an audit trail. To close one, change its
**Status** and complete the **Resolution** field:
- `Open` — newly recorded, not yet addressed.
- `In Progress` — a fix is actively being worked on.
- `Resolved` — fixed. The Resolution field must state the fixing commit SHA, the
date, and a one-line description of the fix.
- `Won't Fix` — intentionally not fixed. The Resolution field must justify why.
- `Deferred` — valid but postponed. The Resolution field must say what it is waiting
on (e.g. a tracked issue or a later milestone).
`Resolved`, `Won't Fix`, and `Deferred` findings are all considered **closed** and
drop off the base README's pending list. `Open` and `In Progress` are **pending**.
## 5. Updating the base README
`code-reviews/README.md` holds the single cross-module view. After any review or
status change, update it:
1. **Pending Findings table** — add/remove rows so it lists exactly the `Open` and
`In Progress` findings across all modules, sorted by severity.
2. **Module Status table** — update the row for the reviewed module (last-reviewed
date, commit, open-finding count, review status).
The base README must always agree with the per-module `findings.md` files — they are
the source of truth; the README is the aggregated index.
## 6. Re-reviewing a module
Re-reviews append to the same `findings.md`. Update the header to the new commit and
date, continue the finding numbering from the last used ID, and leave prior findings
(including closed ones) in place as history.

View File

@@ -0,0 +1,365 @@
# Code Review — Security
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.Security` |
| Design doc | `docs/requirements/Component-Security.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
## Summary
The Security module is small and reasonably structured: a stateless `LdapAuthService`
for search-then-bind authentication, a `JwtTokenService` for HMAC-signed cookie tokens,
a `RoleMapper` that resolves LDAP groups to roles, and ASP.NET Core authorization
policies plus a site-scope handler. Unit-test coverage of the happy paths is decent.
However, the review surfaced several real security weaknesses, the most serious being
that **StartTLS is dead code** (the design's "LDAPS or StartTLS" requirement is only
half met), that **the authentication cookie is not marked `Secure`** despite the design
mandating it, and that **the JWT signing key is never length-validated** so a weak or
empty key is silently accepted. There is also a genuine **DN-injection** gap in the
no-service-account fallback path, a filter/DN attribute mismatch (`uid=` vs `cn=`) that
makes that fallback path internally inconsistent, and an N+1 query in `RoleMapper`.
JWT validation also disables issuer/audience checks and the idle-timeout claim is reset
on every refresh, weakening the documented 30-minute idle policy. None of these are
crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
defects that should be fixed before any production deployment.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `uid=`/`cn=` attribute mismatch between search filter and fallback DN construction (Security-004); StartTLS branch is unreachable (Security-001). |
| 2 | Akka.NET conventions | ☑ | No actors in this module — `AddSecurityActors` is an empty placeholder. Nothing to assess. |
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; LDAP sync calls wrapped in `Task.Run`. No shared mutable state. No issues found. |
| 4 | Error handling & resilience | ☑ | LDAP failure paths return structured `LdapAuthResult`; group-lookup failure is tolerated per design. `ct` not honored inside `Task.Run` bodies (Security-009). |
| 5 | Security | ☑ | StartTLS dead code (Security-001), cookie not `Secure` (Security-002), JWT key unvalidated (Security-003), DN injection (Security-005), no issuer/audience validation (Security-006), idle-timeout reset on refresh (Security-007). |
| 6 | Performance & resource management | ☑ | N+1 scope-rule query in `RoleMapper` (Security-008). `LdapConnection` correctly disposed via `using`. |
| 7 | Design-document adherence | ☑ | StartTLS unsupported and Secure cookie missing both contradict the design doc; design also says "Windows Integrated Authentication" in Responsibilities, contradicting its own Authentication section (Security-010). |
| 8 | Code organization & conventions | ☑ | `SecurityOptions` correctly owned by the component; repository interface in Commons. No issues found. |
| 9 | Testing coverage | ☑ | No tests for `RoleMapper` N+1 behavior, DN-injection inputs, StartTLS path, or idle-timeout-after-refresh. Insecure-config combinations under-tested (Security-011). |
| 10 | Documentation & comments | ☑ | `SecurityOptions` XML docs say direct bind uses `cn={username}` while the search filter uses `uid=` — comment is misleading (covered under Security-004). |
## Findings
### Security-001 — StartTLS upgrade path is unreachable dead code
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:37-47` |
**Description**
When `LdapUseTls` is true the code sets `connection.SecureSocketLayer = true` (LDAPS).
The subsequent StartTLS block is guarded by `if (_options.LdapUseTls && !connection.SecureSocketLayer)`.
Because `SecureSocketLayer` was just set to `true`, the second condition `!connection.SecureSocketLayer`
is always false, so `connection.StartTls()` is never called. The design doc explicitly
states LDAP connections must use **"LDAPS (port 636) or StartTLS"** — StartTLS is in
practice unsupported. A deployment that intends to use StartTLS on port 389 would get a
plaintext LDAPS-mode connection attempt that fails, or worse, an operator may disable
TLS entirely to make it work, sending credentials in cleartext.
**Recommendation**
Introduce an explicit transport mode (e.g. `LdapTransport { Ldaps, StartTls, None }`)
or a separate `LdapUseStartTls` flag. For StartTLS, leave `SecureSocketLayer` false,
call `connection.Connect`, then call `connection.StartTls()` and verify the negotiated
session is encrypted before binding. Remove the unreachable conditional.
**Resolution**
_Unresolved._
### Security-002 — Authentication cookie is not marked `Secure`
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/ServiceCollectionExtensions.cs:16-23` |
**Description**
`AddCookie` sets `HttpOnly = true` and `SameSite = Strict` but never sets
`options.Cookie.SecurePolicy`. The ASP.NET Core default is `CookieSecurePolicy.SameAsRequest`,
which permits the cookie (carrying the embedded JWT — a bearer credential) to be sent
over plain HTTP. The design doc states the cookie is **"HttpOnly and Secure (requires
HTTPS)"**. As written, the module does not enforce that requirement; a misconfigured or
HTTP-fronted deployment would transmit the session token in cleartext.
**Recommendation**
Set `options.Cookie.SecurePolicy = CookieSecurePolicy.Always` in `AddCookie`. Consider
also setting `ExpireTimeSpan` and `SlidingExpiration` to align the cookie lifetime with
the documented 15-minute JWT / 30-minute idle policy.
**Resolution**
_Unresolved._
### Security-003 — JWT signing key length is never validated
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/JwtTokenService.cs:33`, `src/ScadaLink.Security/SecurityOptions.cs:42` |
**Description**
`SecurityOptions.JwtSigningKey` defaults to `string.Empty` and is fed directly into
`new SymmetricSecurityKey(Encoding.UTF8.GetBytes(_options.JwtSigningKey))` with no
validation. HMAC-SHA256 requires a key of at least 256 bits (32 bytes); a short or empty
key produces a trivially forgeable token. The `SecurityHardeningTests` comment claims a
minimum length is "enforced", but no code in this module enforces it — the test only
asserts that a 32+ char key works. A deployment with a missing or short `JwtSigningKey`
would start successfully and issue weakly-signed tokens.
**Recommendation**
Validate `JwtSigningKey` at startup — fail fast if it is empty or shorter than 32 bytes.
Use an `IValidateOptions<SecurityOptions>` validator or guard in the `JwtTokenService`
constructor so a weak key is rejected before any token is issued.
**Resolution**
_Unresolved._
### Security-004 — Search filter uses `uid=` while fallback DN construction uses `cn=`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:66`, `:138`, `:157-159` |
**Description**
`AuthenticateAsync` and `ResolveUserDnAsync` build the search filter as
`(uid={username})`, but the no-service-account fallback in `ResolveUserDnAsync`
constructs the bind DN as `cn={username},{LdapSearchBase}`. The `SecurityOptions.LdapServiceAccountDn`
XML comment also documents the fallback as `cn={username},{LdapSearchBase}`. A directory
keyed on `uid` will succeed via search-then-bind but fail via the direct-bind fallback
(and vice versa). The attribute used for lookup is hard-coded and inconsistent across
the two code paths, so the two configuration modes are not interchangeable.
**Recommendation**
Introduce a single configurable `LdapUserIdAttribute` (default `uid`) and use it
consistently in both the search filter and the fallback DN. Update the XML doc to match.
**Resolution**
_Unresolved._
### Security-005 — DN injection in the no-service-account bind fallback
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:157-159` |
**Description**
When no service account is configured, the user-supplied `username` is interpolated
directly into a distinguished name: `$"cn={username},{LdapSearchBase}"`. `EscapeLdapFilter`
escapes *search-filter* metacharacters, but DN construction requires a different
escaping scheme (RFC 4514 — `,`, `+`, `"`, `\`, `<`, `>`, `;`, leading/trailing spaces).
No DN escaping is applied here. A username such as `victim,ou=admins` alters the DN
structure, allowing a caller to attempt a bind as a different DN than intended. Combined
with the `username.Contains('=')` shortcut at line 129 — which lets a caller supply a
full arbitrary DN — the fallback path gives the client undue control over the bind
identity.
**Recommendation**
Apply RFC 4514 DN-component escaping to `username` before interpolation, or use the
LDAP library's DN-builder API. Reconsider the `Contains('=')` shortcut — accepting a
raw DN from untrusted input is risky; restrict it or remove it.
**Resolution**
_Unresolved._
### Security-006 — JWT validation disables issuer and audience checks
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/JwtTokenService.cs:67-75`, `:56-59` |
**Description**
`ValidateToken` sets `ValidateIssuer = false` and `ValidateAudience = false`, and
`GenerateToken` never sets an `iss` or `aud`. With a shared symmetric HMAC key, any
other system or component that signs JWTs with the same key would produce tokens this
service accepts. While the design states the key is shared only between the two central
nodes, omitting issuer/audience binding removes a cheap defense-in-depth control and
makes accidental key reuse (e.g. the same secret used for another internal token)
silently exploitable.
**Recommendation**
Set a fixed `Issuer` and `Audience` (e.g. `"scadalink-central"`) when generating tokens
and enable `ValidateIssuer`/`ValidateAudience` with the matching expected values during
validation.
**Resolution**
_Unresolved._
### Security-007 — Idle-timeout claim is reset on every token refresh
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/JwtTokenService.cs:40`, `:111-123` |
**Description**
The design states the 30-minute idle timeout is tracked via a "last-activity timestamp
in the token", and `IsIdleTimedOut` reads the `LastActivity` claim. But `RefreshToken`
calls `GenerateToken`, which unconditionally writes `LastActivity = DateTimeOffset.UtcNow`.
Token refresh fires whenever a request arrives within ~5 minutes of expiry. The result
is that `LastActivity` reflects *token issuance time*, not genuine user activity — and
since refresh itself is a request, the timestamp keeps moving forward. A more subtle
consequence: the idle window is effectively measured from the last refresh, not the
last real interaction, so the documented "no requests within the idle window" semantics
are not faithfully implemented. The claim name `LastActivity` is also misleading.
**Recommendation**
Decide explicitly how activity is tracked. Either (a) carry the original `LastActivity`
forward across refreshes and update it only on real request handling in the middleware,
or (b) rename the claim to `IssuedAt`/`TokenCreated` and document that the idle window
is measured from issuance. Whichever is chosen, ensure `IsIdleTimedOut` and the refresh
path agree on the semantics.
**Resolution**
_Unresolved._
### Security-008 — N+1 query loading site-scope rules in `RoleMapper`
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.Security/RoleMapper.cs:25-48` |
**Description**
`MapGroupsToRolesAsync` first calls `GetAllMappingsAsync`, then inside the per-mapping
loop calls `GetScopeRulesForMappingAsync(mapping.Id, ct)` once for every matched
Deployment mapping. This is an N+1 query pattern executed on the login hot path and on
every 15-minute token refresh. With multiple site-scoped Deployment groups it issues a
round-trip per group.
**Recommendation**
Add a repository method that loads scope rules for a set of mapping IDs in one query
(or eager-loads them with the mappings), and resolve all scope rules with a single call.
**Resolution**
_Unresolved._
### Security-009 — CancellationToken not honored inside `Task.Run` LDAP calls
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:42`, `:46`, `:51`, `:56-57`, `:67-73`, `:135`, `:139-145` |
**Description**
The synchronous Novell LDAP calls are wrapped in `Task.Run(() => ..., ct)`. The `ct`
argument only prevents the work item from *starting* if cancellation is already
signaled; once a `connection.Connect`/`Bind`/`Search` call is in progress it cannot be
cancelled. A cancelled or timed-out login request will continue to occupy a thread-pool
thread and an LDAP connection until the blocking call returns on its own. There is also
no explicit network/operation timeout configured on the `LdapConnection`.
**Recommendation**
Configure `LdapConnection.ConnectionTimeout` and search/operation time limits so a
hung LDAP server cannot pin a thread indefinitely. Document that `ct` only guards
work-item scheduling, or implement a timeout-with-disconnect fallback.
**Resolution**
_Unresolved._
### Security-010 — Design doc contradicts itself on Windows Integrated Authentication
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-Security.md:13` (vs. `:23`) |
**Description**
The Responsibilities section states the component authenticates "using Windows
Integrated Authentication", but the Authentication section (line 23) and CLAUDE.md
explicitly state **"No Windows Integrated Authentication ... authenticates directly
against LDAP/AD, not via Kerberos/NTLM"** — which is what the code actually does
(direct LDAP bind). The Responsibilities line is stale and contradicts both the rest of
the doc and the implementation.
**Recommendation**
Fix `Component-Security.md:13` to say "using a direct LDAP/Active Directory bind"
to match the implemented behavior and the rest of the document.
**Resolution**
_Unresolved._
### Security-011 — Missing tests for security-critical paths
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.Security.Tests/UnitTest1.cs` |
**Description**
The test suite covers happy paths well but omits several security-relevant cases:
no test exercises the StartTLS path (Security-001), the DN-injection / `Contains('=')`
fallback inputs (Security-005), JWT validation with a too-short or empty signing key
(Security-003), `IsIdleTimedOut` returning true after a token has been refreshed
(Security-007), or the `uid`/`cn` mismatch in the no-service-account path (Security-004).
The integration `SecurityHardeningTests` only asserts default option values, not
enforcement. The test file is still named `UnitTest1.cs`.
**Recommendation**
Add negative/edge-case tests for the items above, particularly key-length rejection,
DN-escaping of hostile usernames, and idle-timeout behavior across a refresh. Rename
`UnitTest1.cs` to a descriptive name.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,402 @@
# Code Review — SiteEventLogging
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 11 |
## Summary
The SiteEventLogging module is small and broadly well-structured: a SQLite-backed
recorder (`SiteEventLogger`), a query service with keyset pagination, a background
purge service, and a thin Akka actor bridge. The query path is parameterised
correctly (no SQL injection) and reasonably well tested. However, the storage-cap
enforcement is functionally broken: `PRAGMA incremental_vacuum` is a no-op because
`auto_vacuum = INCREMENTAL` is never set, so the cap-purge loop never sees the
database shrink and over-deletes the entire table when triggered. There is also a
genuine concurrency hazard: the purge service and query service share the single
`SqliteConnection` owned by `SiteEventLogger` but bypass its `_writeLock`, so a purge
running on the background thread can collide with a write or a query on another
thread. The `LogEventAsync` API is synchronous despite its name and `Task` return,
which silently blocks Akka actor threads on disk I/O. Other findings concern the
cluster-singleton placement of the handler actor (which can pin to the standby
node), missing indexes for common query filters, retention/cap purge not enforcing
the requirement strictly, and several documentation/maintainability issues.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `incremental_vacuum` no-op breaks cap purge (-001); over-delete on cap (-002). |
| 2 | Akka.NET conventions | ☑ | Handler actor has no supervision/correlation concerns of its own; singleton placement issue (-004). `Ask` boundary is appropriate. |
| 3 | Concurrency & thread safety | ☑ | Shared `SqliteConnection` used by purge/query without the write lock (-003). |
| 4 | Error handling & resilience | ☑ | `LogEventAsync` swallows write failures silently into a log line only (-008); purge catches broadly. |
| 5 | Security | ☑ | Queries fully parameterised. No authz in module (delegated to caller) — noted, not a finding. |
| 6 | Performance & resource management | ☑ | Synchronous I/O on actor threads (-005); missing indexes for severity/source/message (-006). |
| 7 | Design-document adherence | ☑ | Singleton placement contradicts "active node" model (-004); cap purge does not honour "oldest first within budget" cleanly (-002). |
| 8 | Code organization & conventions | ☑ | Concrete-type downcast of `ISiteEventLogger` (-007); `internal Connection` leaks DB handle (-007). |
| 9 | Testing coverage | ☑ | No tests for purge interaction with live writes, vacuum effectiveness, the actor bridge, or query error path (-010). |
| 10 | Documentation & comments | ☑ | `LogEventAsync` XML doc says "asynchronously" but is synchronous (-009); stale "Phase 4+" placeholder (-011). |
## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:100-102`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:36-55` |
**Description**
`PurgeByStorageCap` issues `PRAGMA incremental_vacuum` after each delete batch to
reclaim space, then re-measures the database size via `page_count * page_size`.
`incremental_vacuum` only has any effect when the database was created with
`auto_vacuum = INCREMENTAL`. `InitializeSchema` never sets `auto_vacuum`, so the
database uses the SQLite default (`auto_vacuum = NONE`). With `NONE`,
`incremental_vacuum` is silently ignored and `page_count` does not decrease when
rows are deleted (free pages are retained in the file). Consequently the
`while (currentSizeBytes > capBytes)` loop never observes the size dropping. The
storage-cap feature required by the design ("configurable maximum database size...
oldest events are purged first") is therefore non-functional — it cannot bring the
file back under the cap.
**Recommendation**
Set `PRAGMA auto_vacuum = INCREMENTAL` in `InitializeSchema` before any tables are
created (it must be set before table creation or followed by a full `VACUUM` to take
effect on an existing database). Alternatively, run a full `VACUUM` after cap-purge
deletes, or measure logical data size (e.g. `page_count - freelist_count` times
`page_size`) instead of relying on `incremental_vacuum`.
**Resolution**
_Unresolved._
### SiteEventLogging-002 — Storage-cap purge deletes the entire table when space is not reclaimed
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:87-105` |
**Description**
Because of SiteEventLogging-001 the on-disk size never shrinks after a delete batch,
so `currentSizeBytes` stays above `capBytes`. The loop then keeps deleting 1000-row
batches on every iteration until `ExecuteNonQuery` returns 0 — i.e. until the table
is completely empty. The design states the cap should purge "the oldest events...
first" to stay within budget, not wipe the whole log. When the cap is hit (e.g.
during an alarm storm) this destroys all retained diagnostic history rather than
trimming it to the budget. The unit test `PurgeByStorageCap_DeletesOldestWhenOverCap`
masks the problem because it uses `MaxStorageMb = 0`, which legitimately expects an
empty table, so the over-delete behaviour is never exercised against a realistic cap.
**Recommendation**
Fix the size measurement / vacuum (SiteEventLogging-001) so the loop terminates when
the file is genuinely under the cap. Add a guard so the loop stops once
`currentSizeBytes` has stopped decreasing across iterations, and add a test with a
non-zero cap and a known oversized dataset to assert that only the oldest events are
removed.
**Resolution**
_Unresolved._
### SiteEventLogging-003 — Shared `SqliteConnection` used by purge and query without the write lock
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:64,90,100,110,114`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:36`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34,72` |
**Description**
`SiteEventLogger` owns a single `SqliteConnection` and serialises its own writes via
`lock (_writeLock)`. `EventLogPurgeService` and `EventLogQueryService` both reach
into `_eventLogger.Connection` and execute commands directly, without acquiring
`_writeLock`. The purge runs on a `BackgroundService` thread (a different thread from
event-recording callers and from the actor that drives the query service). A single
`SqliteConnection` / `SqliteCommand` is not thread-safe; concurrent use from the
purge thread and a recording thread (or query thread) can throw
`SqliteException`/`InvalidOperationException` ("DataReader already open",
"connection busy") or corrupt command state. The purge `DELETE` and the recorder
`INSERT` racing is the most likely collision because event recording is continuous.
**Recommendation**
Funnel all access to the connection through a single synchronisation point: either
expose lock-guarded methods on `SiteEventLogger` for purge/query to call, or give the
purge and query services their own dedicated `SqliteConnection` instances (SQLite
supports multiple connections to the same file; `Cache=Shared` plus a `busy_timeout`
makes this safer). Do not share one `SqliteConnection` across threads.
**Resolution**
_Unresolved._
### SiteEventLogging-004 — Event-log handler runs as a cluster singleton that can land on the standby node
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:313-336`, `src/ScadaLink.SiteEventLogging/EventLogHandlerActor.cs:21-25` |
**Description**
`EventLogHandlerActor` is hosted as a `ClusterSingletonManager` singleton with the
stated intent that "queries always reach the active node". However, an Akka.NET
cluster singleton is pinned to the *oldest* member of the role, which is not the
same concept as the SCADA "active node" (the node currently running the Deployment
Manager singleton / serving live traffic). The design doc is explicit: "Only the
active node generates and stores events... the new active node starts logging to its
own SQLite database." The event-log SQLite file is node-local and unreplicated.
Nothing guarantees the event-log singleton co-locates with the active node, so a
remote query can be served by the standby node and read that node's near-empty
database, returning no events even though the active node has a full log. The
explanatory comment in `AkkaHostedService.cs` asserts the opposite of what actually
happens.
**Recommendation**
Either (a) host the query handler as a normal per-node actor and route queries to
the active node explicitly (the node owning the Deployment Manager singleton), or
(b) make the event-log writer follow the same singleton so the writer and the query
handler are guaranteed co-located. Reconcile the design doc and the inline comment
with whichever model is chosen.
**Resolution**
_Unresolved._
### SiteEventLogging-005 — `LogEventAsync` performs synchronous disk I/O on the caller's thread
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57-99` |
**Description**
`LogEventAsync` is declared `async`-shaped (returns `Task`, `Async` suffix) but its
body is entirely synchronous: it takes `lock (_writeLock)`, runs
`cmd.ExecuteNonQuery()` (a blocking SQLite write), then returns `Task.CompletedTask`.
Callers across the codebase invoke it fire-and-forget as `_ = LogEventAsync(...)`
(e.g. `ScriptExecutionActor.cs:133`, `DataConnectionActor.cs:292`,
`ScriptActor.cs:250`) expecting it to be non-blocking. In reality the SQLite write,
and any contention on `_writeLock`, executes inline on the Akka actor thread of the
calling subsystem. Under an event burst (alarm storm, script failure loop) this
serialises actor threads on disk I/O and the global write lock, degrading the
hot-path subsystems the design intends to keep responsive.
**Recommendation**
Either make recording genuinely asynchronous (offload to a dedicated single-threaded
writer / `Channel<T>` consumer so callers truly fire-and-forget), or rename the
method to `LogEvent` and document that it blocks, so callers can decide. Given the
design's emphasis on not impacting runtime subsystems, an internal queue with a
background flush is preferable.
**Resolution**
_Unresolved._
### SiteEventLogging-006 — Missing indexes for severity and keyword-search query paths
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:50-52`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:65-81` |
**Description**
`InitializeSchema` creates indexes on `timestamp`, `event_type`, and `instance_id`.
The query service also filters on `severity` (`severity = $severity`) and performs
`message LIKE '%...%'` / `source LIKE '%...%'` keyword search. `severity` has no
index, and a leading-wildcard `LIKE` cannot use a normal index at all. With up to a
1 GB database and a 500-row page size, severity-filtered and keyword queries do full
table scans on every page. The design explicitly lists keyword search as a supported,
expected query type.
**Recommendation**
Add an index on `severity` (or a composite index aligned with common filter
combinations such as `(event_type, severity, id)`). For keyword search, consider an
FTS5 virtual table over `message` and `source`, or accept the scan but document the
cost.
**Resolution**
_Unresolved._
### SiteEventLogging-007 — `ISiteEventLogger` consumers downcast to the concrete type and reach into the DB connection
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/EventLogPurgeService.cs:25`, `src/ScadaLink.SiteEventLogging/EventLogQueryService.cs:26`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:34` |
**Description**
Both `EventLogPurgeService` and `EventLogQueryService` take `ISiteEventLogger` via
DI and immediately downcast it: `_eventLogger = (SiteEventLogger)eventLogger;`. They
then access the `internal SqliteConnection Connection` property to run arbitrary SQL.
This defeats the purpose of the interface abstraction, makes the registration
fragile (any `ISiteEventLogger` that is not exactly `SiteEventLogger` causes an
`InvalidCastException` at construction), and leaks the database handle and raw SQL
surface out of the recorder. It is also the root cause of the unsynchronised
connection sharing in SiteEventLogging-003.
**Recommendation**
Introduce a proper data-access abstraction (e.g. an `IEventLogStore` with
`Insert`, `Query`, `PurgeOlderThan`, `PurgeToSize`, `GetSizeBytes`) that owns the
connection and its locking, and inject that into the recorder, query, and purge
services. Remove the `internal Connection` property and the concrete-type downcasts.
**Resolution**
_Unresolved._
### SiteEventLogging-008 — Event-recording write failures are silently swallowed
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:92-95` |
**Description**
If `ExecuteNonQuery` throws (disk full, database locked, file corruption), the
exception is caught, written to `ILogger`, and discarded; `LogEventAsync` still
returns `Task.CompletedTask` as if successful. Callers fire-and-forget the result so
they cannot detect failure. The event log is the site's diagnostic audit trail; a
sustained write failure (for example a locked-database storm caused by the
unsynchronised purge in SiteEventLogging-003) means events vanish with no signal to
operators except a local log line that nobody is watching. There is no failure
counter, no health-metric hook, and no retry.
**Recommendation**
Expose a failure signal: increment a counter that the Health Monitoring component
can surface (the design notes script/alarm error rates are derived from the event
log — a logging outage should be visible). At minimum, escalate repeated failures to
a Warning/Error health metric rather than only a local log line.
**Resolution**
_Unresolved._
### SiteEventLogging-009 — XML doc on `LogEventAsync` claims asynchronous behaviour
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/ISiteEventLogger.cs:8-10`, `src/ScadaLink.SiteEventLogging/SiteEventLogger.cs:57` |
**Description**
The interface XML doc states "Record an event asynchronously." and the method is
named `LogEventAsync`, but the implementation is fully synchronous (see
SiteEventLogging-005). The documentation and naming are misleading: a reader will
reasonably assume the write is offloaded and the caller's thread is not blocked,
which is false. The `details` parameter doc says "Optional JSON details" but nothing
validates or requires JSON, so callers may pass arbitrary text.
**Recommendation**
Align the name, signature, and documentation with the actual behaviour — either make
the method genuinely asynchronous or rename to `LogEvent` and correct the doc.
Clarify that `details` is free-form text unless JSON is actually enforced.
**Resolution**
_Unresolved._
### SiteEventLogging-010 — Test coverage gaps: actor bridge, purge/write concurrency, vacuum effectiveness, query error path
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteEventLogging.Tests/` |
**Description**
The test suite covers recording, query filtering/pagination, and basic purge, but
several critical behaviours are untested:
- `EventLogHandlerActor` has no test — the actor message contract
(`EventLogQueryRequest` -> `EventLogQueryResponse`, `Sender.Tell`) is unverified.
- No test exercises purge running concurrently with active writes/queries, so the
connection-sharing race (SiteEventLogging-003) is invisible to CI.
- `PurgeByStorageCap` is only tested with `MaxStorageMb = 0`, which hides the
no-op-vacuum / over-delete bug (SiteEventLogging-001, -002). No test asserts the
file shrinks or that only oldest events are removed under a realistic cap.
- `EventLogQueryService.ExecuteQuery`'s catch block (`Success: false`,
`ErrorMessage`) has no test.
- `SiteEventLogger.Dispose` semantics (logging after dispose returns
`Task.CompletedTask`) and re-entrant dispose are untested.
**Recommendation**
Add tests for the actor bridge, a concurrency stress test (purge + write + query in
parallel), a realistic non-zero-cap purge test asserting size reduction and
oldest-first deletion, and a query-error-path test (e.g. corrupt/closed connection).
**Resolution**
_Unresolved._
### SiteEventLogging-011 — Stale "Phase 4+" placeholder in `ServiceCollectionExtensions`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteEventLogging/ServiceCollectionExtensions.cs:18-22` |
**Description**
`AddSiteEventLoggingActors` is an empty method with a comment "Placeholder for Akka
actor registration (Phase 4+)". The actor (`EventLogHandlerActor`) is in fact already
implemented and is registered directly in `AkkaHostedService.cs:313-336`, not through
this method. The placeholder is dead code: it is either never called or called with
no effect, and the comment is stale. A reader looking for where the event-log actor
is wired up will be misdirected.
**Recommendation**
Either implement the actor registration here and have `AkkaHostedService` call it
(centralising the wiring), or delete `AddSiteEventLoggingActors` entirely and remove
the misleading comment.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,564 @@
# Code Review — SiteRuntime
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.SiteRuntime` |
| Design doc | `docs/requirements/Component-SiteRuntime.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 16 |
## Summary
The SiteRuntime module is broadly well-structured: the actor hierarchy matches the
design doc, supervision strategies are explicit, and the trigger/alarm evaluation
logic is thorough. However the review surfaced one genuinely serious correctness
defect — `Instance.SetAttribute` never routes writes to the Data Connection Layer
for data-sourced attributes, contradicting a core design decision and silently
turning device writes into local-only static overrides. Several other findings
cluster around two themes: (1) actor-thread discipline is violated in a few hot
paths (blocking `.GetAwaiter().GetResult()` calls on the actor thread, a fragile
fixed-delay reschedule for redeployment), and (2) the site-local repositories reach
into `SiteStorageService` private state via reflection and mint entity IDs with the
non-deterministic `string.GetHashCode()`. Script execution runs on the default
thread pool rather than a dedicated blocking dispatcher (the code acknowledges this
in a comment but ships it anyway). Test coverage exists for the coordinator actors,
persistence and scripting, but the short-lived execution actors, the replication
actor, and the repositories are untested.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | SetAttribute mis-routing, deploy double-count, redeploy reschedule race. |
| 2 | Akka.NET conventions | ✓ | Blocking on actor thread, script execution not on a dedicated dispatcher, premature success reply. |
| 3 | Concurrency & thread safety | ✓ | `_attributes` dictionary shared with child actors by reference; `_executionCounter` is actor-confined (OK). |
| 4 | Error handling & resilience | ✓ | Deploy reports Success before persistence; replicated artifact/S&F failures only logged (matches best-effort design). |
| 5 | Security | ✓ | Trust-model validation is substring-based and weak; reflection used to read private fields. |
| 6 | Performance & resource management | ✓ | Per-call SQLite connections (acceptable); CPU-bound scripts not interruptible by timeout. |
| 7 | Design-document adherence | ✓ | SetAttribute DCL routing missing; staggered-startup and supervision otherwise conform. |
| 8 | Code organization & conventions | ✓ | Repositories reflect into another class; synthetic IDs non-deterministic. |
| 9 | Testing coverage | ✓ | No tests for ScriptExecutionActor, AlarmExecutionActor, SiteReplicationActor, or the two repositories. |
| 10 | Documentation & comments | ✓ | Several XML comments describe behaviour the code does not implement (see findings). |
## Findings
### SiteRuntime-001 — `Instance.SetAttribute` never writes to the Data Connection Layer
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:106`, `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:204` |
**Description**
The design doc (Component-SiteRuntime.md, "GetAttribute / SetAttribute" and
"Script Runtime API") states that `Instance.SetAttribute` on a *data-connected*
attribute must send a write request to the DCL, which writes to the physical
device, and that the in-memory value is **not** optimistically updated. For *static*
attributes it updates memory and persists an override.
The implementation makes no such distinction. `ScriptRuntimeContext.SetAttribute`
unconditionally sends a `SetStaticAttributeCommand`, and `InstanceActor.HandleSetStaticAttribute`
unconditionally treats every write as a static override: it mutates `_attributes`,
publishes an `AttributeValueChanged` with hard-coded `"Good"` quality, notifies
children, and persists a SQLite override. A script writing a data-sourced attribute
therefore never reaches the device, the write failure can never be returned
synchronously to the script, and the in-memory value diverges from the device
until the next subscription update overwrites it. The persisted override is also
wrong: data-sourced attributes should not have static overrides.
**Recommendation**
In `InstanceActor`, look up the target attribute in `_configuration.Attributes`. If
it has a non-empty `DataSourceReference`, issue a DCL write (e.g. a `WriteTagRequest`
to `_dclManager`) and surface success/failure to the caller; do not persist an
override and do not optimistically mutate `_attributes`. Only attributes with no
data source reference should follow the current static-override path. Consider
splitting the message into `SetStaticAttributeCommand` vs `SetDataAttributeCommand`,
or branching inside the handler.
**Resolution**
_Unresolved._
### SiteRuntime-002 — `RouteInboundApiSetAttributes` always treats writes as static overrides
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:632` |
**Description**
`RouteInboundApiSetAttributes` (handling `Route.To().SetAttribute(s)` from the
Inbound API) emits a `SetStaticAttributeCommand` for every attribute, so it inherits
the same defect as SiteRuntime-001: writes to data-sourced attributes never reach
the device and are instead persisted as static overrides. In addition the response
is sent back as unconditionally successful (`true`) before the Instance Actor has
even processed the command, so a non-existent attribute or a future DCL write
failure is reported to the external caller as success.
**Recommendation**
Route through the same corrected `InstanceActor` write handler as SiteRuntime-001 so
the static-vs-data distinction is honoured. The optimistic ack is acceptable for
fire-and-forget static writes per the doc, but the XML comment should make the
limitation explicit, and once data-attribute writes are supported they need a real
response path.
**Resolution**
_Unresolved._
### SiteRuntime-003 — Redeployment relies on a fixed 500 ms reschedule and can collide on the child actor name
| | |
|--|--|
| Severity | High |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:222` |
**Description**
`HandleDeploy` stops an existing Instance Actor with `Context.Stop` and then
reschedules the same `DeployInstanceCommand` to itself after a hard-coded 500 ms,
hoping the child has fully terminated by then. `Context.Stop` is asynchronous; the
child is only removed from the parent's children collection after it actually stops
(including running `PostStop` on its descendants). If a deeply nested or slow
hierarchy takes longer than 500 ms, `CreateInstanceActor` calls `Context.ActorOf`
with a name that still belongs to the terminating child and throws
`InvalidActorNameException`. The `_instanceActors` dictionary check does not prevent
this — the dictionary entry is removed immediately, but the Akka child registry is
not. The 500 ms delay is also unconditionally added to every redeploy latency.
**Recommendation**
Watch the terminating child (`Context.Watch`) and recreate the Instance Actor only
after receiving the `Terminated` message, instead of guessing with a timer. Buffer
or stash the in-flight `DeployInstanceCommand` (and any further commands for that
instance) until termination completes.
**Resolution**
_Unresolved._
### SiteRuntime-004 — `_totalDeployedCount` is incremented on redeployment of an existing instance
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:239` |
**Description**
In `HandleDeploy`, the existing-actor branch (line 223) reschedules the command and
returns. When the rescheduled command runs, no actor exists, so the code falls
through to the "new instance" branch and executes `_totalDeployedCount++`
(line 239). A redeployment is an *update* of an already-deployed instance, not a new
one, so the deployed count is over-counted by one on every redeploy.
`StoreDeployedConfigAsync` uses UPSERT semantics, so the SQLite row count does not
grow, but the in-memory `_totalDeployedCount` (reported to the health collector via
`UpdateInstanceCounts`) drifts upward and the reported "disabled" count becomes
wrong.
**Recommendation**
Only increment `_totalDeployedCount` when the instance is genuinely new. Either
track whether this deploy replaced an existing config, or derive the deployed count
from storage / the union of running actors and disabled configs rather than
maintaining a hand-incremented counter.
**Resolution**
_Unresolved._
### SiteRuntime-005 — Deployment reports `Success` to central before persistence completes
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:272` |
**Description**
`HandleDeploy` replies to central with `DeploymentStatus.Success` immediately after
creating the Instance Actor, while the SQLite persistence (`StoreDeployedConfigAsync`
+ `ClearStaticOverridesAsync`) runs asynchronously on a `Task.Run`. If persistence
fails, `HandleDeployPersistenceResult` only logs an error — central has already been
told the deployment succeeded. On a subsequent node restart or failover the instance
will not be re-created (it is not in SQLite), so the deployment is silently lost
despite central recording success. This contradicts the design's intent that the
site is the durable source of truth for deployed configs.
**Recommendation**
Persist the config before replying, or treat a persistence failure as a deployment
failure and send a corrective `DeploymentStatusResponse`/health signal to central.
At minimum, do not report `Success` until the config row is committed.
**Resolution**
_Unresolved._
### SiteRuntime-006 — Site-local repositories read `SiteStorageService` private field via reflection
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:183`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:181` |
**Description**
Both repositories' `CreateConnection()` use `Type.GetField("_connectionString",
BindingFlags.NonPublic | BindingFlags.Instance)` to extract the private connection
string out of `SiteStorageService`. This is brittle (any rename or refactor of the
field breaks it at runtime, not compile time), defeats encapsulation, and the
accompanying XML comment openly describes it as a "pragmatic" hack and is internally
contradictory (it states a connection string is "passed separately at DI
registration time" which is not what the code does). It also sits awkwardly against
the project's own script trust model, which forbids `System.Reflection` in scripts.
**Recommendation**
Expose the connection string properly: add an `ISiteStorageConnectionProvider`
(already referenced in `ServiceCollectionExtensions` XML docs but not used), or have
`SiteStorageService` expose a `CreateConnection()` factory, and inject that into the
repositories. Remove the reflection entirely.
**Resolution**
_Unresolved._
### SiteRuntime-007 — Synthetic entity IDs use the non-deterministic `string.GetHashCode()`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Repositories/SiteExternalSystemRepository.cs:241`, `src/ScadaLink.SiteRuntime/Repositories/SiteNotificationRepository.cs:254` |
**Description**
`GenerateSyntheticId` computes `name.GetHashCode() & 0x7FFFFFFF`. On .NET Core,
`string.GetHashCode()` is randomized per process by default, so the "stable
deterministic synthetic ID" promised by the XML comment is not stable at all — it
changes every time the process restarts. Any caller that obtained an ID and later
calls `GetExternalSystemByIdAsync`/`GetNotificationListByIdAsync` after a restart
will fail to find the entity. It also risks collisions: distinct names can hash to
the same 31-bit value, and `GetExternalSystemByIdAsync` would then return the wrong
row.
**Recommendation**
Use a deterministic, collision-resistant hash (e.g. a stable FNV-1a or the first
bytes of a SHA-256 of the name) if a synthetic integer ID is genuinely required, or
better, change the repository contract to key these site-local artifacts by name
rather than synthesising integer IDs.
**Resolution**
_Unresolved._
### SiteRuntime-008 — Blocking `.GetAwaiter().GetResult()` on the actor thread during startup
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:479` |
**Description**
`LoadSharedScriptsFromStorage` is called synchronously from
`HandleStartupConfigsLoaded` (the actor's message handler) and performs
`_storage.GetAllSharedScriptsAsync().GetAwaiter().GetResult()` followed by Roslyn
compilation of every shared script. This blocks the DeploymentManager singleton's
mailbox thread for the full duration of the SQLite read and all shared-script
compilation. On the default dispatcher this also ties up a thread-pool thread and
risks thread-pool starvation, and the singleton cannot process any other message
(deployments, lifecycle commands, debug routing) until it returns. The rest of the
class correctly uses `PipeTo`/`ContinueWith`.
**Recommendation**
Load shared scripts asynchronously and `PipeTo(Self)` an internal message, the same
pattern already used for `StartupConfigsLoaded`. Perform compilation either inside
the piped continuation handler (still on the actor thread but at least off the
synchronous startup path) or on a dedicated background task whose result is piped
back.
**Resolution**
_Unresolved._
### SiteRuntime-009 — Script execution actors run scripts on the default thread pool, not a dedicated dispatcher
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptExecutionActor.cs:72`, `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:289`, `src/ScadaLink.SiteRuntime/Actors/AlarmExecutionActor.cs:57` |
**Description**
The design (CLAUDE.md "Architecture & Runtime") states Script Execution Actors run
on a *dedicated blocking I/O dispatcher*. The code does not do this: `ScriptActor.SpawnExecution`
and `AlarmActor.SpawnAlarmExecution` create the execution actors with no
`.WithDispatcher(...)`, and the execution itself runs inside a bare `Task.Run`,
i.e. on the shared .NET thread pool. The `// NOTE: In production, configure a
dedicated ... dispatcher` comments acknowledge the gap but it ships unconfigured.
Scripts can perform synchronous blocking I/O (`Database.Connection`, synchronous
`ExternalSystem.Call`); running them on the shared pool can starve it and stall
unrelated Akka dispatchers and HTTP request handling under load.
**Recommendation**
Define the dedicated dispatcher in HOCON and chain `.WithDispatcher(...)` on the
execution actor `Props`. If the `Task.Run` model is kept, run script bodies on a
dedicated `TaskScheduler` / bounded scheduler rather than the global pool. Either
way, remove the "in production, configure…" comments by actually configuring it.
**Resolution**
_Unresolved._
### SiteRuntime-010 — `EnsureDclConnections` never updates a connection whose configuration changed
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:413` |
**Description**
`EnsureDclConnections` tracks created connections in `_createdConnections` and skips
any name already present (`if (_createdConnections.Contains(name)) continue;`). The
skip is purely name-based: if a redeployment (or an artifact deployment) changes the
endpoint, credentials, backup endpoint, or `FailoverRetryCount` of an existing
connection, the new configuration is silently ignored and the DCL keeps using the
stale `CreateConnectionCommand`. There is no `UpdateConnectionCommand` path. The
design states that after artifact deployment the site is fully self-contained with
current configuration; this caching breaks that for connection changes.
**Recommendation**
Compare the incoming connection config against the last one sent and re-issue a
create/update command when it differs, or have the DCL treat `CreateConnectionCommand`
as idempotent upsert and always forward it. Key the cache on a config hash, not just
the name.
**Resolution**
_Unresolved._
### SiteRuntime-011 — Trust-model validation is a substring scan and is both over- and under-inclusive
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScriptCompilationService.cs:52` |
**Description**
`ValidateTrustModel` enforces the script trust model by doing raw `string.Contains` /
`IndexOf` on the script source text for forbidden namespace strings. This is
unreliable in both directions:
- **Bypass (under-inclusive):** the check looks only for the literal namespace
strings. A script can reach forbidden APIs without ever writing `System.IO` etc. —
e.g. via fully-qualified type use through aliasing, `global::`-prefixed names, or
simply because the namespace is already imported transitively. The compilation
references include `typeof(object).Assembly` (the whole of `System.Private.CoreLib`,
which contains `System.IO.File`, `System.Threading.Thread`, `System.Reflection`,
etc.), so forbidden types are fully resolvable at compile time and the only barrier
is this textual scan.
- **False positives (over-inclusive):** any occurrence of the substring in a comment,
string literal, or an unrelated identifier (e.g. a variable named `ProcessThreading`)
triggers a violation; the `AllowedExceptions` logic only rescues exact prefixes.
- The dead `isAllowed` variable at line 64 is computed and never used.
**Recommendation**
Enforce the trust model with a Roslyn `SyntaxWalker`/semantic analysis (inspect
resolved symbols and their containing namespaces/assemblies), or restrict the
compilation's metadata references and `AssemblyLoadContext` so forbidden types are
genuinely unavailable, rather than relying on source-text matching. Remove the
unused `isAllowed` variable.
**Resolution**
_Unresolved._
### SiteRuntime-012 — `AttributeAccessor`/`ScopeAccessors` block the script on a synchronous Ask
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Scripts/ScopeAccessors.cs:28` |
**Description**
`AttributeAccessor`'s indexer getter calls `_ctx.GetAttribute(...).GetAwaiter().GetResult()`,
synchronously blocking the script-execution thread on an actor Ask. Combined with
SiteRuntime-009 (scripts run on the shared thread pool) this means a script that
reads several attributes via `Attributes["X"]` holds a pool thread blocked for each
round-trip. The async variants (`GetAsync`/`SetAsync`) exist but the ergonomic
indexer encourages the blocking path. The XML comment notes "Reads block on the
actor Ask" but does not warn about the thread-pool impact.
**Recommendation**
Once a dedicated script dispatcher exists (SiteRuntime-009) the blocking is contained
to that pool, which is acceptable; until then, document the cost clearly and prefer
steering script authors to the async accessors. Consider making the indexer
internal-only and exposing only the async API.
**Resolution**
_Unresolved._
### SiteRuntime-013 — `HandleUnsubscribeDebugView` does nothing despite documented behaviour
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/InstanceActor.cs:414` |
**Description**
`HandleUnsubscribeDebugView` is documented ("Debug view unsubscribe — removes
subscription") and the actor registers a handler for `UnsubscribeDebugViewRequest`,
but the body only logs a debug message — there is no subscription state in the
Instance Actor to remove. The design places the actual subscription lifecycle in
`SiteStreamManager` (`Subscribe`/`Unsubscribe`/`RemoveSubscriber`), so the Instance
Actor genuinely has nothing to do here. The handler and its XML comment are
therefore misleading: a reader expects it to tear down a subscription.
**Recommendation**
Either remove the no-op handler and route `UnsubscribeDebugViewRequest` to wherever
the `SiteStreamManager` subscription is actually cancelled, or correct the XML
comment to state explicitly that subscription teardown is handled by
`SiteStreamManager` and this handler is a no-op acknowledgement.
**Resolution**
_Unresolved._
### SiteRuntime-014 — Trigger-expression evaluation blocks the coordinator actor thread
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/ScriptActor.cs:219`, `src/ScadaLink.SiteRuntime/Actors/AlarmActor.cs:389` |
**Description**
`EvaluateExpressionTrigger` (ScriptActor) and `EvaluateExpression` (AlarmActor) run a
compiled Roslyn script with `.RunAsync(...).GetAwaiter().GetResult()` directly inside
the actor's `AttributeValueChanged` message handler. This blocks the coordinator
actor's mailbox thread for up to the 2-second timeout on every monitored attribute
change. Coordinator actors are on the default dispatcher and process the hot path of
attribute-change fan-out; a slow expression delays all other messages to that actor
and consumes a thread-pool thread for the duration. The inline comments correctly
note CPU-bound expressions are not interruptible but do not address the
mailbox-blocking concern.
**Recommendation**
Trigger expressions are expected to be cheap, but to keep the actor responsive
consider evaluating them off the actor thread (pipe the boolean result back as an
internal message) or pre-compiling to a plain delegate that executes near-instantly
without the Roslyn scripting `RunAsync` machinery.
**Resolution**
_Unresolved._
### SiteRuntime-015 — `LoggerFactory` created per Instance Actor and never disposed
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.SiteRuntime/Actors/DeploymentManagerActor.cs:746` |
**Description**
`CreateInstanceActor` does `var loggerFactory = new LoggerFactory();` for every
Instance Actor it creates, uses it once to produce an `ILogger<InstanceActor>`, and
never disposes it. `LoggerFactory` is `IDisposable`. With up to 500 instances (and
churn from redeployments) this leaks a factory per instance, and the produced
loggers are detached from the application's configured logging providers, so
Instance Actor logs may not be routed/filtered consistently with the rest of the
host.
**Recommendation**
Inject the application's `ILoggerFactory` (or an `ILogger<InstanceActor>` factory
delegate) into `DeploymentManagerActor` via DI and reuse it, rather than newing one
up per child. Do not create a fresh `LoggerFactory` in a hot creation path.
**Resolution**
_Unresolved._
### SiteRuntime-016 — Short-lived execution actors, replication actor, and repositories are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.SiteRuntime.Tests/` |
**Description**
The test project covers the coordinator actors (`InstanceActor`, `ScriptActor`,
`AlarmActor`, `DeploymentManagerActor`), persistence, scripting and streaming, but a
search of the test sources finds no references to `ScriptExecutionActor`,
`AlarmExecutionActor`, `SiteReplicationActor`, `SiteExternalSystemRepository`, or
`SiteNotificationRepository`. These cover critical paths: script timeout/failure
handling and result reply, alarm on-trigger execution, peer config/S&F replication
(including the `SendToPeer` no-peer drop), and the reflection-based repository reads.
Several findings above (001/002 mis-routing, 007 ID instability, 011 trust bypass)
would likely have been caught by targeted tests.
**Recommendation**
Add unit/integration tests for the execution actors (success, timeout, exception,
Ask-reply, PoisonPill self-stop), `SiteReplicationActor` (outbound forward, inbound
apply, peer tracking on cluster events), and the two repositories (round-trip read,
synthetic-ID lookup, missing-row behaviour).
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,465 @@
# Code Review — StoreAndForward
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 13 |
## Summary
The Store-and-Forward module is small and readable, with a clean SQLite persistence
layer, a sensible service API, and reasonable test coverage of the storage and service
happy paths. However the review surfaced two issues that undermine the module's core
purpose. First, the active delivery path never invokes the `ReplicationService`
`ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered
messages are not replicated to the standby node and the design's failover-durability
guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met.
Second, there is an off-by-one in retry accounting: the immediate-failure path stores a
buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N`
is actually attempted `N` times in total rather than one immediate attempt plus `N`
retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional
themes: SQLite connection-per-call with no transactional grouping of multi-statement
operations, no concurrency guard against a parked message being retried while the
sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented
status set, and untested critical paths (retry-due timing, replication-from-active,
the actor bridge). None of the findings are blockers for compilation, but the
replication and retry-count issues are functional defects against the design.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
## Findings
### StoreAndForward-001 — Replication to standby is never triggered by the active node
| | |
|--|--|
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` |
**Description**
`ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark`
to forward buffer operations to the standby node, but a codebase-wide search shows these
methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync`
line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park
(`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService`
and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and
`ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design
doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to
forward every buffer operation to the standby so that, on failover, the new active node
"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database
stays empty and a failover loses the entire buffer — a data-loss defect against a core
requirement.
**Recommendation**
Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue`
after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`,
and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update
`ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that
asserts the replication handler observes each operation type.
**Resolution**
_Unresolved._
### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` |
**Description**
`EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162)
when no delivery handler is registered for the category. The retry sweep
(`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and
`return`s without touching the message. No caller in the codebase ever calls
`RegisterDeliveryHandler` (the External System Gateway, Notification Service and
Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered
message lands in this dead state: it is persisted, counts toward buffer depth, but can
never be retried, delivered or parked. It will sit Pending forever. Either the handler
registration is missing from Host/gateway startup, or the "buffer for later" path is a
silent trap. Either way the engine cannot deliver anything.
**Recommendation**
Decide the intended contract. If handlers are expected to be registered before
`EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no
handler exists rather than silently buffering an undeliverable message, and wire
`RegisterDeliveryHandler` calls in Host startup for all three categories. If late
registration is intended, the retry sweep should treat a still-missing handler as a
transient condition with bounded logging rather than a permanent no-op.
**Resolution**
_Unresolved._
### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` |
**Description**
On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with
`message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before
the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233).
Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first*
retry sweep without ever being retried, because after the immediate attempt `RetryCount`
is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting
the design intent that the immediate attempt and the retry budget are distinct;
(2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is
seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount`
shown to operators off by one relative to configured `MaxRetries`. The
`EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`,
locking in the ambiguity.
**Recommendation**
Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts,
including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound
*retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the
immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust
the comparison. Update the affected test to match the chosen semantics.
**Resolution**
_Unresolved._
### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` |
**Description**
The XML comment on the handler delegate (lines 3740) says "Returns true on success,
throws on transient failure. Permanent failures should return false (message will NOT
be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`,
a handler returning `false` does not "not buffer" — the message is already buffered, and
the code *parks* it immediately (lines 218224). The comment describes only the
`EnqueueAsync` immediate path and misleads anyone implementing a handler about what
`false` means once a message is in the retry loop.
**Recommendation**
Reword the contract to cover both paths explicitly: `true` = delivered (remove from
buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a
retry); exception = transient failure (buffer / increment retry).
**Resolution**
_Unresolved._
### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` |
**Description**
`RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then
processes them one by one (line 184), `await`-ing delivery for each. Meanwhile
`RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via
`ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because
each operation opens its own SQLite connection and there is no row-level coordination,
an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently
delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a
now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row
to Pending while the sweep simultaneously parks the same in-flight message, the operator
intent is silently overwritten. The `Interlocked` guard only prevents *overlapping
sweeps*, not sweep-vs-management races.
**Recommendation**
Funnel all message-state mutations through a single serialization point — e.g. perform
all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor)
so the actor mailbox serialises them, or make status transitions conditional in SQL
(e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected
row count.
**Resolution**
_Unresolved._
### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` |
**Description**
`GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two
commands on the same connection with no surrounding transaction. A concurrent
enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with
the returned page (e.g. total reported as 51 while only 50 distinct parked rows now
exist, or a row visible in the page but excluded from the count). For a paginated UI
this produces flickering totals and occasional off-by-one page math.
**Recommendation**
Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent
snapshot, or accept the staleness and document it. A transaction is cheap here and
removes the inconsistency.
**Resolution**
_Unresolved._
### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` |
**Description**
The three handlers call a `Task`-returning service method and chain `.ContinueWith(...)
.PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is
safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation
on a thread-pool thread and the captured continuation builds the response objects there
— acceptable since it only touches locals, but it bypasses the idiomatic
`PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a
line touching actor state inside the continuation. There is also no `TaskContinuationOptions`,
so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`,
but only by convention).
**Recommendation**
Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ...,
failure: ex => ...)`, which is the documented Akka pattern, keeps response construction
off the actor thread safely, and makes the success/failure branches explicit.
**Resolution**
_Unresolved._
### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` |
**Description**
Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls
`OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug,
but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/
`RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection
churn under load. With no max buffer size (by design) the buffer can grow large, so the
per-message connection acquisition is a measurable overhead on the hot retry path.
**Recommendation**
Consider a batched retry API that opens one connection (and one transaction) per sweep,
or pass an open connection into the per-message update calls. At minimum, document that
the design relies on the Sqlite connection pool for acceptable performance.
**Resolution**
_Unresolved._
### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` |
**Description**
`OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in
`RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller
thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures
the delegate once so it will not NRE, but there is no synchronisation around the event
field itself; a subscriber added/removed concurrently with a raise has no defined
ordering. More importantly, subscriber callbacks run synchronously on the timer thread,
so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a
subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is
swallowed as a "transient failure," wrongly incrementing the message's retry count).
**Recommendation**
Snapshot the delegate (already done) and additionally wrap subscriber invocation in a
`try/catch` so a faulting logging subscriber cannot be misclassified as a delivery
failure. Document that handlers must be fast and non-throwing, or dispatch activity
notifications asynchronously.
**Resolution**
_Unresolved._
### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` |
**Description**
`RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL`
but leaves `last_attempt_at` unchanged (line 203206). The retry-due query
(`GetMessagesForRetryAsync`, line 101105) selects Pending rows where
`last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after
exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since
that stale timestamp is almost certainly already greater than the retry interval, so the
operator-retried message is attempted on the very next sweep regardless of the
configured interval. That is probably the desired behaviour (operator wants it tried
now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the
behaviour would instead be "try immediately" by accident rather than by design.
**Recommendation**
Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on
re-queue so the message is unambiguously due now, or set it to "now" so it waits one
interval. Document the chosen behaviour in the method's XML comment.
**Resolution**
_Unresolved._
### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` |
**Description**
The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses
`Pending` and `Parked``InFlight` and `Delivered` are never assigned (delivered
messages are deleted, not marked `Delivered`). Meanwhile the Component design doc
("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the
code's enum drifts from the doc in two directions: it carries dead members the doc does
not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message
mid-retry is indistinguishable from one that has never been attempted.
**Recommendation**
Reconcile the enum with the design. Either drop the unused members and update the doc,
or implement the documented `retrying` state and use `InFlight` to mark a message the
sweep is actively delivering (which would also help with finding 005).
**Resolution**
_Unresolved._
### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` |
**Description**
`StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the
`sf_messages` table and is also carried across the network inside `ReplicationOperation`
(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states
that entity classes are persistence-ignorant POCOs in Commons and that message contracts
follow additive-only evolution. Because this type doubles as a replication wire contract
but lives in the component assembly, it is not co-located with the other Commons
entities and its evolution is not governed by the additive-only message-contract rule.
This is a borderline case (the type is site-local), but the cross-node use via
`ReplicationOperation` makes it a de-facto message contract.
**Recommendation**
Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons
`Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or
introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the
local persistence model. Document the decision.
**Resolution**
_Unresolved._
### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` |
**Description**
The existing tests cover storage CRUD and the service happy/failure paths well, but
three important behaviours are untested: (1) the retry-due time filter in
`GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`,
so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is
never exercised with a non-zero interval; a message that is *not yet due* should be
skipped, and that is never verified. (2) Replication from the active side — no test
asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap
behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no
test at all — the Query/Retry/Discard request-to-response mapping and the
`ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch.
**Recommendation**
Add tests for: a non-zero retry interval where a recently-attempted message is excluded
and an older one is included; active-side replication invocation per operation type
(once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`,
including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and
invalid-JSON payloads.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,487 @@
# Code Review — TemplateEngine
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.TemplateEngine` |
| Design doc | `docs/requirements/Component-TemplateEngine.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 14 |
## Summary
The Template Engine is a pure central-side modeling library: stateless services
over `ITemplateEngineRepository` plus four static helper classes (collision, cycle,
lock, resolver). It has no Akka actors and no direct concurrency, so the Akka and
thread-safety categories produce nothing of substance. The code is generally
well-structured and the cascade-based composition model (derived templates owned by
composition slots) is consistently applied. However the review surfaced several real
correctness gaps. The most serious are in **flattening**: composed alarms and scripts
nested below the first level are silently dropped, derived templates omit base
alarms entirely (breaking per-slot alarm override), and the alarm-on-trigger-script
resolution step is an empty placeholder so that whole validation rule is dead.
Validation has two security-relevant weaknesses — the forbidden-API scan is a naive
substring match and the brace-balance "compile" check mispredicts on verbatim /
interpolated / raw string literals. Several documented behaviours (collision check on
create, optimistic concurrency on instance state) are claimed but not implemented.
Themes: validation that is weaker than the design promises, and asymmetric handling
of attributes vs. alarms vs. scripts throughout the resolve/flatten/derive paths.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ✓ | Multiple real bugs: deep composed-member loss, derived alarms omitted, granularity bypass, no-op create-time collision block. |
| 2 | Akka.NET conventions | ✓ | No actors in this module (`AddTemplateEngineActors` is an empty placeholder). Nothing to assess. |
| 3 | Concurrency & thread safety | ✓ | Services are stateless, scoped per request; static helpers hold no mutable state. Design says template editing is last-write-wins; that is honoured. See TemplateEngine-010 re: a doc claim of optimistic concurrency that is not implemented. |
| 4 | Error handling & resilience | ✓ | `Result<T>` used consistently; repository nulls guarded. `FlatteningService` wraps in try/catch. No store-and-forward or failover surface in this module. |
| 5 | Security | ✓ | No auth checks in-module (delegated to callers per design). Script trust-model enforcement is weak — see TemplateEngine-006 and TemplateEngine-007. |
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` reloaded on most member edits; one genuine N+1 in `TemplateDeletionService` (TemplateEngine-009). No `IDisposable` leaks (`JsonDocument`/streams disposed). |
| 7 | Design-document adherence | ✓ | Drift found: recursive composition not fully implemented in flattening; `DataType` enum naming differs from doc; optimistic-concurrency claim. |
| 8 | Code organization & conventions | ✓ | POCO entities in Commons, repo interfaces in Commons, Options pattern N/A (no options here). Duplicate deletion logic (TemplateEngine-014). |
| 9 | Testing coverage | ✓ | Tests exist for every file, but the dead/placeholder paths (TemplateEngine-004, 005) and deep nesting (TemplateEngine-001) are not exercised. |
| 10 | Documentation & comments | ✓ | Mostly accurate; a misleading converter comment (TemplateEngine-011) and a stale enum/doc mismatch (TemplateEngine-012). |
## Findings
### TemplateEngine-001 — Deeply nested composed members are dropped during flattening
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:211`, `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:535`, `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:609` |
**Description**
The design doc states composition supports "recursive nesting of feature modules"
and that nested paths extend as `[Outer].[Inner].[Member]`. `ResolveComposedAttributes`
only descends **one** level of nesting: it resolves the directly-composed module, then
its immediate child compositions, and stops. A module composed three or more levels
deep contributes no attributes to the flattened configuration. `ResolveComposedAlarms`
and `ResolveComposedScripts` are worse — they handle only the first (direct) level and
do not descend at all, so any alarm or script in a nested composed module is dropped
entirely. `CollisionDetector` and `TemplateResolver` recurse fully, so collision
detection and the authoring UI will show members that the deployed configuration
silently lacks.
**Recommendation**
Replace the hand-unrolled one/two-level loops with a single recursive walk
(carrying the accumulated path prefix) for attributes, alarms, and scripts, matching
the recursion already in `TemplateResolver.AddComposedMembers` and
`CollisionDetector.CollectComposedMembers`.
**Resolution**
_Unresolved._
### TemplateEngine-002 — Derived templates omit all base alarms; composed alarms cannot be overridden per slot
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:799` |
**Description**
`BuildDerivedTemplate` copies the base template's `Attributes` and `Scripts` into the
new derived template as `IsInherited = true` placeholder rows so they can be overridden
per composition slot, but there is **no loop for `Alarms`**. The derived template
therefore has zero alarm rows. The `TemplateAlarm` entity also has no `IsInherited` or
`LockedInDerived` fields (unlike `TemplateAttribute` / `TemplateScript`), so even if a
copy loop were added there is no mechanism to mark a copied alarm as inherited or to
override one. The design's Override Granularity section explicitly requires composed
alarm fields (Priority, Trigger thresholds, Description, On-Trigger Script) to be
overridable. As written, a composed module's alarms cannot be tuned for the slot they
are used in.
**Recommendation**
Add an alarm copy loop to `BuildDerivedTemplate` and add `IsInherited` /
`LockedInDerived` fields to `TemplateAlarm`, mirroring `TemplateAttribute`. Update
`UpdateAlarmAsync` to honour them as `UpdateAttributeAsync` / `UpdateScriptAsync`
already do.
**Resolution**
_Unresolved._
### TemplateEngine-003 — `UpdateAttributeAsync` lets a non-locked attribute change its fixed DataType / DataSourceReference
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:285` |
**Description**
`LockEnforcer.ValidateAttributeOverride` correctly rejects a change to `DataType` or
`DataSourceReference` (both "fixed by the defining level" per the design). But the
caller only honours that error when the attribute is already locked:
```csharp
var granularityError = LockEnforcer.ValidateAttributeOverride(existing, proposed);
if (granularityError != null && existing.IsLocked)
return Result<TemplateAttribute>.Failure(granularityError);
```
Lines 293-294 then unconditionally apply `existing.DataType = proposed.DataType` and
`existing.DataSourceReference = proposed.DataSourceReference`. For the common case of an
unlocked attribute, the fixed-field guard is dead and both fields are silently mutable,
violating the override-granularity rule. (The lock-error branch of the same helper is
also redundant — a locked attribute already returns earlier inside the helper.)
**Recommendation**
Remove the `&& existing.IsLocked` condition so the granularity error is always
returned, and stop assigning `DataType` / `DataSourceReference` from `proposed` in the
apply block.
**Resolution**
_Unresolved._
### TemplateEngine-004 — Alarm on-trigger script references are never resolved (empty placeholder)
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/FlatteningService.cs:695` |
**Description**
`ResolveAlarmScriptReferences` is invoked as Step 7 of `Flatten` but its body is empty
— only a comment describing what it should do. Consequently every
`ResolvedAlarm.OnTriggerScriptCanonicalName` stays `null`. This has two downstream
effects: (1) `SemanticValidator`'s "on-trigger script must exist" check
(`SemanticValidator.cs:209`) can never fire, so the design-mandated validation of
alarm on-trigger script references is silently absent; (2) `RevisionHashService` and
`DiffService` both hash/compare `OnTriggerScriptCanonicalName`, so a change to which
script an alarm triggers never affects the revision hash and is invisible to the diff
— a real staleness-detection gap.
**Recommendation**
Implement the resolution: map each alarm's `OnTriggerScriptId` (set on `TemplateAlarm`)
to the canonical name of the corresponding resolved script, accounting for composition
prefixes. If the design intends scripts to be referenced by name within scope, document
and implement that consistently.
**Resolution**
_Unresolved._
### TemplateEngine-005 — Collision validation is skipped when creating a child template
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:56` |
**Description**
`CreateTemplateAsync` contains a block guarded by `if (parentTemplateId.HasValue)` that
loads `GetAllTemplatesAsync` and then does nothing but hold a comment — it never runs a
collision check. A child template created with a parent inherits the parent's members;
if the child is later given members (via `AddAttributeAsync` etc.) those calls do run
`CollisionDetector`, but the create path itself performs no naming-collision validation
and `UpdateTemplateAsync` only validates collisions on a name change. The design states
naming collisions are design-time errors that must block a save. The dead block is also
confusing and allocates an unused full-table read.
**Recommendation**
Either run a real collision check on the to-be-created template (including its
inherited members) or delete the dead block and its unused query. If create-time
collisions are genuinely impossible because a fresh template has no members, document
that explicitly instead of leaving a no-op.
**Resolution**
_Unresolved._
### TemplateEngine-006 — Forbidden-API enforcement is a naive substring scan (bypassable and false-positive prone)
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Validation/ScriptCompiler.cs:21`, `src/ScadaLink.TemplateEngine/Validation/ValidationService.cs:318` |
**Description**
`ScriptCompiler.ForbiddenPatterns` is checked with `code.Contains(pattern)`. This is
both under- and over-inclusive against the script trust model:
- **Bypass:** `using System.IO;` followed by `File.ReadAllText(...)` contains no
`System.IO.` token; `using static System.IO.File;`, namespace aliases, and
`global::System.IO.File` all evade the literal patterns.
- **False positive:** a string literal, comment, or attribute name containing the text
`System.IO.` is flagged as a forbidden API even though it is inert.
The same patterns are reused for trigger-expression validation
(`CheckExpressionSyntax`), inheriting the same weakness. The file comment acknowledges
this is interim until Roslyn is wired in, but the trust model is security-relevant and
the gap should be tracked.
**Recommendation**
Defer real enforcement to the Roslyn-based compiler (semantic symbol analysis of
referenced types/namespaces) rather than text matching. Until then, document the
limitation prominently and treat the substring scan as advisory, not authoritative.
**Resolution**
_Unresolved._
### TemplateEngine-007 — Brace-balance "compilation" misjudges verbatim / interpolated / raw strings
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Validation/ScriptCompiler.cs:54`, `src/ScadaLink.TemplateEngine/SharedScriptService.cs:124` |
**Description**
`ScriptCompiler.TryCompile` tracks string state with a single `inString` flag toggled
on `"` and an escaped-quote check of `code[i-1] != '\\'`. It does not understand
verbatim strings (`@"..."` where `""` is the escape and `\` is literal), interpolated
strings (`$"{...}"` whose braces are code, not text), raw string literals (`"""..."""`),
or char literals. A script with a verbatim string containing a brace, an interpolated
string, or a `'}'` char literal will be wrongly rejected as having mismatched braces —
blocking a valid script from deployment. `SharedScriptService.ValidateSyntax` is even
cruder: it counts braces/brackets/parens with no string or comment awareness at all, so
any string literal containing one of those characters produces a false syntax error.
**Recommendation**
Once the Roslyn compiler is available, parse with `CSharpSyntaxTree.ParseText` and
inspect diagnostics instead of hand-rolling a tokenizer. If an interim check must
remain, at minimum handle verbatim/interpolated/char literals or scope the check down
to something that cannot false-positive.
**Resolution**
_Unresolved._
### TemplateEngine-008 — `SetAlarmOverrideAsync` accepts overrides for unknown / composed alarms with no validation
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Services/InstanceService.cs:178` |
**Description**
`SetAlarmOverrideAsync` looks up the alarm by name among the template's **direct**
alarms only. When the lookup returns `null` — which is the case for every composed
(path-qualified) alarm as well as for a genuinely non-existent name — the method skips
the lock check and proceeds to persist the override. This means: (1) an override can be
created for an alarm that does not exist (a silent dead record), and (2) a composed
alarm that is `IsLocked` at the template level can be overridden, bypassing the lock
rule. `SetAttributeOverrideAsync` by contrast rejects unknown attribute names. The
inline comment acknowledges the gap but the behaviour is inconsistent and risky.
**Recommendation**
Resolve the full effective alarm set (via the resolver / flattening) so composed
alarms are found, reject overrides whose canonical name is not in that set, and apply
the lock check to composed alarms too.
**Resolution**
_Unresolved._
### TemplateEngine-009 — N+1 query in `TemplateDeletionService.CanDeleteTemplateAsync`
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Services/TemplateDeletionService.cs:75` |
**Description**
Check 3 ("other templates compose it directly") loads all templates and then issues a
separate `GetCompositionsByTemplateIdAsync` call **inside a loop over every template**
— one round-trip per template in the database. The composition information needed is
already reachable via `t.Compositions` on the templates returned by
`GetAllTemplatesAsync` (which `TemplateService.DeleteTemplateAsync` uses for the
equivalent check at line 162). The loop scales linearly with the template count on
every delete-precheck and every actual delete.
**Recommendation**
Use the `Compositions` navigation already loaded by `GetAllTemplatesAsync`, or add a
single repository call that returns all compositions, rather than querying per
template.
**Resolution**
_Unresolved._
### TemplateEngine-010 — `InstanceService` documents optimistic concurrency that is not implemented
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Services/InstanceService.cs:9` |
**Description**
The class summary states instances support "Enabled/disabled state with optimistic
concurrency". `EnableAsync`, `DisableAsync`, `AssignToAreaAsync` and the override/binding
mutators all perform a plain read-modify-write with no version token, `RowVersion`, or
concurrency check. Two concurrent enable/disable requests last-writer-wins with no
detection. Either the doc is stale (the design's optimistic-concurrency decision
applies to *deployment status records*, not instance state) or a concurrency token was
intended and is missing.
**Recommendation**
If last-write-wins is acceptable for instance state, correct the XML doc. If optimistic
concurrency is required, add a concurrency token to `Instance` and surface a conflict
result.
**Resolution**
_Unresolved._
### TemplateEngine-011 — `SortedPropertiesConverterFactory` is dead code with a misleading comment
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Flattening/RevisionHashService.cs:136` |
**Description**
`SortedPropertiesConverterFactory.CanConvert` always returns `false` and
`CreateConverter` always returns `null`, so the factory registered in
`CanonicalJsonOptions` does nothing. The class comment claims it "ensures properties are
serialized in alphabetical order for deterministic output", and the options comment says
"Ensure consistent ordering" — both are false. Determinism actually relies entirely on
the `Hashable*` records being hand-declared with alphabetically-ordered properties (plus
camelCase). That works today but is fragile: a future contributor adding a property out
of alphabetical order silently changes every revision hash, and the dead converter gives
false confidence that ordering is enforced programmatically.
**Recommendation**
Either implement the converter to genuinely sort properties, or delete it and replace
the comments with an explicit note that determinism depends on the manual property
ordering of the `Hashable*` records (ideally enforced by a test).
**Resolution**
_Unresolved._
### TemplateEngine-012 — `DataType` enum naming diverges from the design doc
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/Validation/SemanticValidator.cs:18` |
**Description**
The design doc (Attribute section) lists data types as "Boolean, Integer, Float,
String". The actual `DataType` enum is `Boolean, Int32, Float, Double, DateTime,
Binary`. `SemanticValidator.NumericDataTypes` correctly hard-codes the real names
(`Int32`, `Float`, `Double`), so the code is internally consistent, but the design doc
is stale — it omits `Double`, `DateTime`, `Binary` and calls the integer type
"Integer". This makes the doc an unreliable reference for which trigger-operand types
are numeric.
**Recommendation**
Update `docs/requirements/Component-TemplateEngine.md` to list the actual enum members,
or rename the enum to match the doc if "Integer" is the intended canonical name.
**Resolution**
_Unresolved._
### TemplateEngine-013 — `ToDictionary(t => t.Id)` throws on duplicate IDs; cycle detectors overload Id 0 as a sentinel
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/CycleDetector.cs:30`, `src/ScadaLink.TemplateEngine/CycleDetector.cs:38` |
**Description**
Across the static helpers, `allTemplates.ToDictionary(t => t.Id)` is used freely; if the
caller ever passes a list containing two templates with the same `Id` (e.g. a
not-yet-saved template assigned `Id == 0`, or duplicated input) the call throws an
unhandled `ArgumentException` rather than returning a `Result` failure. Separately,
`CycleDetector` uses `0` as the "no parent" sentinel (`currentId != 0`,
`ParentTemplateId ?? 0`) and `DetectInheritanceCycle` / `DetectCrossGraphCycle` ignore a
proposed parent/composed id of `0`. EF identity keys start at 1 so this is currently
benign, but the overload is fragile — an in-memory or test template with `Id == 0`
would be treated as "no template" and cycle checks would be silently skipped.
**Recommendation**
Guard the dictionary builds (or use a grouping/`ToLookup`) and validate input, and use
`int?`/`-1` rather than `0` as the no-parent sentinel so a real id of 0 is never
special.
**Resolution**
_Unresolved._
### TemplateEngine-014 — Template-deletion constraint logic is duplicated and divergent
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.TemplateEngine/TemplateService.cs:109`, `src/ScadaLink.TemplateEngine/Services/TemplateDeletionService.cs:27` |
**Description**
`TemplateService.DeleteTemplateAsync` and `TemplateDeletionService.CanDeleteTemplateAsync`
both implement the "can this template be deleted" rules (instances, child templates,
derived templates, composing templates). The two implementations have already drifted:
`TemplateService` reads composing templates from the in-memory `t.Compositions`
navigation while `TemplateDeletionService` issues per-template
`GetCompositionsByTemplateIdAsync` calls (see TemplateEngine-009), they format error
messages differently, and `TemplateService` returns on the first failing category while
`TemplateDeletionService` accumulates all of them. A future rule change must be made in
two places or behaviour will diverge further.
**Recommendation**
Make `TemplateService.DeleteTemplateAsync` delegate to `TemplateDeletionService` (or
vice versa) so the constraint logic lives in exactly one place.
**Resolution**
_Unresolved._

View File

@@ -0,0 +1,67 @@
# Code Review — <Module>
<!--
Template for a module review. Copy the structure below into
code-reviews/<Module>/findings.md and fill it in.
See ../REVIEW-PROCESS.md for the full process.
-->
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.<Module>` |
| Design doc | `docs/requirements/Component-<Name>.md` |
| Status | Not yet reviewed \| In progress \| Reviewed |
| Last reviewed | YYYY-MM-DD |
| Reviewer | <name> |
| Commit reviewed | `<short SHA>` |
| Open findings | 0 |
## Summary
One short paragraph: overall health of the module, themes across findings, and
anything notable that is not a finding.
## Checklist coverage
Confirm every category was examined. Record "No issues found" where applicable.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☐ | |
| 2 | Akka.NET conventions | ☐ | |
| 3 | Concurrency & thread safety | ☐ | |
| 4 | Error handling & resilience | ☐ | |
| 5 | Security | ☐ | |
| 6 | Performance & resource management | ☐ | |
| 7 | Design-document adherence | ☐ | |
| 8 | Code organization & conventions | ☐ | |
| 9 | Testing coverage | ☐ | |
| 10 | Documentation & comments | ☐ | |
## Findings
<!-- One entry per finding. Copy the block below. Never delete a finding; close it
by changing Status and completing Resolution. -->
### <Module>-001 — <Short title>
| | |
|--|--|
| Severity | Critical \| High \| Medium \| Low |
| Category | <one of the 10 checklist categories> |
| Status | Open \| In Progress \| Resolved \| Won't Fix \| Deferred |
| Location | `src/ScadaLink.<Module>/<File>.cs:<line>` |
**Description**
What is wrong and why it matters.
**Recommendation**
Concrete suggested fix.
**Resolution**
_Unresolved._
<!-- When closed: fixing commit `<SHA>`, date YYYY-MM-DD, one-line description.
For Won't Fix / Deferred, justify the decision here. -->