docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security

21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
This commit is contained in:
Joseph Doherty
2026-05-17 00:48:25 -04:00
parent 89636e2bbf
commit 3b3760f026
6 changed files with 873 additions and 41 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
class, but the readiness bug is High because it breaks load-balancer behaviour with
no safe workaround.
#### Re-review 2026-05-17 (commit `39d737e`)
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
in the current tree — the `/health/ready` predicate, the externalised secrets, the
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
described in their Resolution notes. This re-review walked all ten checklist
categories again over the full module and recorded four new findings, none of them
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
of cluster timing values silently degrades any sub-second configuration (Host-013),
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
## Findings
@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
— they assert the configured level actually filters log events; they pass against the
new factory (and would fail against the old inline configuration, which ignored the
key). Full Host suite green (175 passed).
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
**Description**
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
documented as "the design-doc requirement", default `true` — but a repo-wide search
shows it is referenced **nowhere outside its own declaration**. The Host therefore
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
on`. This is the same class of defect as the resolved Host-011
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
declared, bound, and documented but never consumed, which silently misleads any
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
ClusterInfrastructure project owns only the configuration contract, per the
`ClusterOptions` class comment).
**Recommendation**
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
doc comment instead so code and configuration contract agree — but do not leave it
declared-but-dead.
**Resolution**
_Unresolved._
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
**Description**
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0`
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
the change is a correctness hazard for exactly the kind of failure-detection tuning
these options exist for.
**Recommendation**
Render durations without precision loss — emit milliseconds (e.g.
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
validate in `StartupValidator` that each cluster timing value is a positive whole
number of seconds and fail fast otherwise. Either way, do not silently round.
**Resolution**
_Unresolved._
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
**Description**
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
configuration section — but neither `appsettings.json` nor either role-specific file
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
console output template, the file path, and the rolling interval cannot be changed
without recompiling — an operator cannot redirect logs, change the file location, or
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
call is effectively a no-op because the section it reads does not exist.
**Recommendation**
Move the console and file sink definitions into a `Serilog` section in
`appsettings.json` (with `WriteTo` entries and the output template / file path /
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
state the sinks are intentionally code-defined — but the design doc currently says
"configuration-driven", so code and doc must be reconciled.
**Resolution**
_Unresolved._
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
**Description**
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
on a Central node in production *validates* the schema version and throws if the
deployed schema does not match the expected migration set. A schema-version mismatch
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
that is fatal on the first attempt. The retry helper is meant for *transient*
connection faults (the XML doc says exactly that: "the database may be briefly
unreachable"), but it cannot distinguish transient from permanent failures.
**Recommendation**
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
isTransient` predicate and, for the migration call site, treat only
connection-class exceptions (`SqlException` with a connection/transport error
number, `TimeoutException`, socket errors) as retryable while letting
schema-validation / `InvalidOperationException` failures propagate immediately. Add
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
single attempt.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.InboundAPI` |
| Design doc | `docs/requirements/Component-InboundAPI.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -30,6 +30,25 @@ well but there is no coverage of the HTTP endpoint, concurrency, or recompilatio
None of the findings are data-loss-class, but the concurrency and trust-model issues
are High severity and should be addressed before production use.
#### Re-review 2026-05-17 (commit `39d737e`)
All 13 findings from the initial review remain `Resolved`; the module source under
`src/ScadaLink.InboundAPI` is unchanged since the last InboundAPI fix commit
(`8dd7412`), which precedes `39d737e`. This re-review re-walked all 10 checklist
categories against the resolved code and surfaced **4 new findings** — none touching
the previously-fixed concurrency/trust-model code, but all in areas the first pass
did not probe deeply: (1) the `ReturnDefinition` column is loaded onto `ApiMethod`
but is never consulted — script return values are serialized verbatim with no shaping
or validation against the declared return structure (InboundAPI-014); (2) the new
`ForbiddenApiChecker` is a purely textual syntax walker and can be bypassed by
reaching forbidden functionality through member access that never spells a forbidden
namespace, e.g. `typeof(x).Assembly.GetType("System.IO.File")` (InboundAPI-015);
(3) routed `Route.To().Call()` invocations are not bound by the method timeout unless
the script explicitly threads `Parameters`-side cancellation, contradicting the design
statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteHelper`
/ `RouteTarget` — the entire WP-4 cross-site routing surface — has no test coverage
(InboundAPI-017). New findings are one Medium-trio plus one Low; no Critical or High.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -38,11 +57,11 @@ are High severity and should be addressed before production use.
| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. |
| 5 | Security | ☑ | Prior items resolved. Re-review: `ForbiddenApiChecker` is a textual deny-list bypassable via reflection without a forbidden namespace token (InboundAPI-015). |
| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". |
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. |
| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. |
| 7 | Design-document adherence | ☑ | Re-review: `ReturnDefinition` loaded but never used (InboundAPI-014); routed-call timeout not enforced (InboundAPI-016). Prior `Database.Connection()`/central-only items resolved. |
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` moved to Commons (InboundAPI-012 resolved); no new issues. |
| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
## Findings
@@ -580,3 +599,181 @@ the new method-not-found status, and removing dead code cannot regress. Doc-owne
follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a
"method not found" status; it should note that it is reported as 403 (indistinguishable
from "key not approved"), but that doc edit is outside this module's editable scope.
### InboundAPI-014 — `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:201-205`, `src/ScadaLink.Commons/Entities/InboundApi/ApiMethod.cs:10` |
**Description**
`Component-InboundAPI.md` ("API Method Definition → Return Value Definition" and the
"Response Format" section) specifies that each method has a declared return structure
— "Field names and data types … Supports returning lists of objects" — and that the
success response body is "the method's return value as JSON, with fields matching the
return value definition". The `ApiMethod` entity carries a `ReturnDefinition` column
to hold exactly this. However, nothing in the module ever reads `ReturnDefinition`:
`ExecuteAsync` takes whatever object the script happens to return and does a blind
`JsonSerializer.Serialize(result)`. There is no validation that the script's return
value matches the declared shape, no coercion to the declared field types, and no
error when a method returns a structure inconsistent with its definition. A method
whose script returns the wrong shape (or `null` where a structure is required) will
silently emit a malformed 200 response, and the documented return-definition contract
is effectively unenforced. This is the response-side mirror of the parameter
validation that `ParameterValidator` does perform, leaving the two halves of the
method contract asymmetric.
**Recommendation**
Either (a) implement return-value validation/shaping: parse `ReturnDefinition` with
the same extended-type machinery used for parameters and validate/coerce the script
result before serializing, returning a 500 (or logging) when the script result does
not match; or (b) if return shaping is deliberately out of scope, remove the "Return
Value Definition" / "fields matching the return value definition" language from
`Component-InboundAPI.md` and document that the response is the script's raw return
value serialized as-is. Code and design doc must be reconciled.
**Resolution**
_Unresolved._
### InboundAPI-015 — `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/ForbiddenApiChecker.cs:63-119`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:109-126` |
**Description**
`ForbiddenApiChecker` walks the script syntax tree and rejects any `using` directive,
`QualifiedNameSyntax`, or `MemberAccessExpressionSyntax` whose textual dotted name
starts with a forbidden namespace prefix (`System.IO`, `System.Diagnostics`,
`System.Reflection`, `System.Net`, etc.). This is a textual match, not a semantic
one, and the trust model it enforces (per InboundAPI-005) is explicitly meant to keep
*untrusted* Design-role scripts away from host APIs. The check can be bypassed because
forbidden functionality is reachable through member access that never spells a
forbidden namespace:
- `typeof(string).Assembly.GetType("System.IO.File")``typeof(string)` is permitted,
`.Assembly` is a `System.Type` property, `.GetType(string)` is a `System.Reflection.Assembly`
method. The string literal `"System.IO.File"` is a string, not a `QualifiedNameSyntax`
or `MemberAccessExpressionSyntax`, so `IsForbidden` never sees it. The script obtains
a `System.IO.File` `Type` and can `InvokeMember`/`GetMethod(...).Invoke(...)` on it —
all via members of permitted types — with no forbidden namespace ever appearing in
the source. `CompileAndRegister` references `typeof(object).Assembly`
(System.Private.CoreLib) in `ScriptOptions`, so every framework type is loadable at
runtime.
- The executor also references the `Microsoft.CSharp.RuntimeBinder` assembly
(`InboundScriptExecutor.cs:116`), enabling the `dynamic` keyword, which further
widens late-bound member access that the static walker cannot see through.
Because the inbound API script runs on the central node with the host process's
privileges and is authored by the (less-trusted-than-Admin) Design role, a static
textual deny-list gives a false sense of containment.
**Recommendation**
Treat the syntax walker as defence-in-depth, not the boundary. Strengthen it where
cheap (flag `Assembly.GetType`, `Type.GetType`, `Activator.CreateInstance`,
`InvokeMember`, and `dynamic` usage), but for real enforcement run compiled scripts
under a genuine boundary — a restricted `AssemblyLoadContext`/AppDomain-equivalent, a
curated reference set that does not expose reflection-to-arbitrary-type, or an
out-of-process sandbox — consistent with however the Site Runtime ultimately enforces
its instance-script trust model. At minimum, document in `Component-InboundAPI.md`
that the current check is best-effort and does not stop a determined script.
**Resolution**
_Unresolved._
### InboundAPI-016 — Routed `Route.To().Call()` invocations are not bound by the method timeout
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:59-152`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:177`, `:199` |
**Description**
`Component-InboundAPI.md` states the per-method timeout "defines the maximum time the
method is allowed to execute (**including any routed calls to sites**)", and the
Routing Behavior section says a routed call "blocks until the site responds or the
**method-level timeout** is reached". The executor builds a linked
`CancellationTokenSource` (`cts`) combining the request-abort token and a dedicated
timeout CTS, and exposes `cts.Token` to the script as `InboundScriptContext.CancellationToken`.
However, every `RouteTarget` method (`Call`, `GetAttribute(s)`, `SetAttribute(s)`)
takes `CancellationToken cancellationToken = default` and the script must *explicitly*
pass the context token for the routed call to honour the timeout. A natural script —
`Route.To("inst").Call("doWork", parameters)` — invokes the routed call with
`CancellationToken.None`. That request flows into `CommunicationService.RouteToCallAsync`
with no cancellation, so the routed call is not bounded by the method timeout at all.
The only timeout guard left is `handler(context).WaitAsync(cts.Token)` in
`ExecuteAsync`: when the method timeout fires, `WaitAsync` returns a cancellation to
the caller, but the underlying script `Task` — and the in-flight `RouteToCallAsync`
awaiting a remote site — keeps running orphaned with no cancellation, holding the
correlation/communication resources until the site eventually responds or its own
transport timeout (if any) fires. The design's guarantee that the method timeout
covers routed calls is therefore not met, and a slow/hung site can leak background
work past the timeout the caller was told bounds the request.
**Recommendation**
Make routed calls inherit the method deadline without relying on script discipline:
have `RouteHelper`/`RouteTarget` carry the executing method's `CancellationToken`
(injected by `InboundScriptExecutor` when it constructs the context, e.g. a
`RouteHelper` bound to `cts.Token`) and pass it into every `CommunicationService`
call by default, so `Route.To("x").Call("s", p)` is timeout-bounded with no token
argument. Keep the explicit-token overload for callers that want a tighter bound.
Verify `RouteToCallAsync` and the attribute-routing calls actually observe the token
and abandon the in-flight request when it fires.
**Resolution**
_Unresolved._
### InboundAPI-017 — `RouteHelper` / `RouteTarget` has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:1-165`, `tests/ScadaLink.InboundAPI.Tests/` |
**Description**
`RouteHelper`/`RouteTarget` is the entire WP-4 cross-site routing surface — the
`Route.To().Call()/GetAttribute(s)/SetAttribute(s)` API that inbound API scripts use
to reach instances at any site. It has zero tests: the `ScadaLink.InboundAPI.Tests`
project covers `ApiKeyValidator`, `ParameterValidator`, `InboundScriptExecutor`, and
`InboundApiEndpointFilter`, but no test file exercises `RouteHelper`. Untested
behaviours include site resolution via `IInstanceLocator` (including the
"instance not found / no assigned site" `InvalidOperationException` path at
`RouteHelper.cs:154-164`), the `!response.Success``InvalidOperationException`
translation in each routed method, `GetAttribute` delegating to the batch
`GetAttributes` and returning `null` for an absent key, correlation-ID generation,
and `SetAttribute` delegating to `SetAttributes`. These are non-trivial branches
whose failure modes (a thrown exception inside a script) surface to the caller as a
500, so regressions would be silent.
**Recommendation**
Add a `RouteHelperTests` suite using substituted `IInstanceLocator` and
`CommunicationService` (the executor tests already substitute `CommunicationService`):
cover the happy path of each routed method, the unresolved-instance throw, the
`!Success``InvalidOperationException` mapping, and `GetAttribute` returning `null`
for a missing key. This also gives InboundAPI-016 a regression home if the timeout
wiring is added.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.ManagementService` |
| Design doc | `docs/requirements/Component-ManagementService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (1 Deferred — see ManagementService-012) |
| Commit reviewed | `39d737e` |
| Open findings | 4 (1 Deferred — see ManagementService-012) |
## Summary
@@ -28,6 +28,24 @@ instances never disposed) and dead/unused configuration. None of the findings ar
crash-class, but the site-scope gaps are High severity because they are a real
authorization bypass with no workaround.
#### Re-review 2026-05-17 (commit `39d737e`)
All thirteen prior findings remain correctly closed; the source under
`src/ScadaLink.ManagementService` is byte-identical between the previously reviewed state
and `39d737e` (the resolution commits of findings 001013 are folded into the history at or
before `39d737e`). ManagementService-012 was re-checked and its **Deferred** status still
holds: `ManagementEnvelope.Command` is still typed `object`, and the marker-interface fix
still belongs in the Commons module, outside this module's edit scope — nothing has changed
to make it actionable here. This re-review re-ran the full 10-category checklist against the
current sources and surfaced **four new findings**. The dominant theme is the same
site-scope authorization gap that findings 001/002 closed: `HandleQueryDeployments`
(`QueryDeploymentsCommand`) was overlooked by that sweep and still performs no site-scope
enforcement, letting a site-scoped Deployment user read deployment history for any site
(014, High). The remaining three are lower severity: a non-atomic multi-override mutation
that can leave an instance partially modified after an error (015, Medium), raw exception
messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
`QueryDeploymentsCommand` having no test coverage at all (017, Low).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -551,3 +569,144 @@ infrastructure, so the testable command-parsing/dispatch logic was extracted int
`ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5),
`SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the
pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed.
### ManagementService-014 — HandleQueryDeployments bypasses site-scope enforcement
| | |
|--|--|
| Severity | High |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:306`, `:1174` |
**Description**
`QueryDeploymentsCommand` is gated to the `Deployment` role by `GetRequiredRole`
(`ManagementActor.cs:170``:177`), and the design document's Authorization section states
"Site scoping is enforced for site-scoped Deployment users" and explicitly lists deployments
among the Deployment-role operations. `HandleQueryDeployments` makes no call to
`EnforceSiteScope` / `EnforceSiteScopeForInstance` / `EnforceSiteScopeForIdentifier`: with no
`InstanceId` it returns `IDeploymentManagerRepository.GetAllDeploymentRecordsAsync()` (every
deployment record across all sites), and with an `InstanceId` it returns that instance's
deployment history with no check that the instance's site is within the caller's permitted
set. A site-scoped Deployment user scoped to site A can therefore enumerate deployment
records for instances at site B — instance IDs, `DeployedBy` (operator usernames),
timestamps, deployment status, and `ErrorMessage` content — by issuing `QueryDeployments`
with or without an out-of-scope `InstanceId`. This is the same authorization-bypass class as
the resolved findings 001/002, on a handler that sweep did not cover; it is `DispatchCommand`'s
only `Deployment`-role handler with no scope enforcement.
**Recommendation**
Thread `AuthenticatedUser` into `HandleQueryDeployments` (the dispatch case at line 306
already has `user` in scope). When `cmd.InstanceId` is supplied, call
`EnforceSiteScopeForInstance` before querying. When it is not supplied, filter the returned
`DeploymentRecord` list to the caller's permitted sites — resolve each record's instance to
its `SiteId` (or join through a site-aware repository query) and drop records for sites
outside `PermittedSiteIds`, mirroring the `HandleListInstances` / `HandleListSites` filter
pattern. Add a regression test for a site-scoped user against in-scope and out-of-scope
instances.
**Resolution**
_Unresolved._
### ManagementService-015 — HandleSetInstanceOverrides applies overrides non-atomically
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:647``:659` |
**Description**
`HandleSetInstanceOverrides` iterates `cmd.Overrides` and calls
`InstanceService.SetAttributeOverrideAsync` once per attribute, throwing
`InvalidOperationException` on the first `result.IsSuccess == false`. Each
`SetAttributeOverrideAsync` call persists independently, so if the command supplies five
overrides and the third fails (e.g. an unknown attribute name or a validation error), the
first two overrides are already committed to the configuration database while the caller
receives a `ManagementError`. The instance is left partially mutated in a state the operator
neither sees nor requested, and the per-instance operation lock referenced in the design's
deployment decisions does not protect against this because the partial writes are committed
before the throw. A retry of the same command then re-applies the already-applied overrides.
**Recommendation**
Make the multi-override mutation all-or-nothing: either validate every requested override up
front before applying any, or apply all overrides within a single transaction / unit-of-work
so a mid-batch failure rolls back the earlier writes. If `InstanceService` cannot offer a
batch method, at minimum document the partial-application behaviour on `SetInstanceOverridesCommand`
and have the handler report which overrides were applied before the failure so the caller
can reconcile.
**Resolution**
_Unresolved._
### ManagementService-016 — Unexpected exception messages returned verbatim to HTTP callers
| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:121``:131` |
**Description**
`MapFault` maps any non-`SiteScopeViolationException` fault to
`new ManagementError(correlationId, cause.Message, "COMMAND_FAILED")`, and
`ManagementEndpoints.HandleRequest` returns that `Error` string directly in the HTTP 400
body. For handler-thrown `InvalidOperationException`s carrying a curated `result.Error`
message this is intended and safe. But the same path also surfaces the raw `.Message` of
unanticipated exceptions — a `SqlException` (which can include server/database names and
constraint details), a `DbUpdateException`, an `ArgumentException` from `Enum.Parse` on a
malformed `DataType`/`TriggerType` value, or a `NullReferenceException` — straight to the
external CLI/HTTP client. This is a minor internal-detail disclosure surface: the exception
text is already logged server-side with full context, so the client does not need the raw
message.
**Recommendation**
Distinguish handler-curated failures from unexpected faults. Have handlers throw a dedicated
exception type (e.g. `ManagementCommandException`) for messages that are safe to surface, and
in `MapFault` return that message for the known type while returning a generic
"An internal error occurred (CorrelationId=...)" string for everything else — the operator
can still correlate to the server log via the correlation ID.
**Resolution**
_Unresolved._
### ManagementService-017 — QueryDeploymentsCommand has no test coverage
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` |
**Description**
`QueryDeploymentsCommand` / `HandleQueryDeployments` is exercised by no test in
`ManagementActorTests`. There is no test that it requires the `Deployment` role, no test of
the `InstanceId`-filtered versus unfiltered branches, and — because the handler performs no
site-scope enforcement at all — no test that would have caught finding 014. The deployment
query is one of the operations the design's Authorization section calls out for site
scoping, yet it is the only `Deployment`-role command with neither an authorization test nor
a site-scope test.
**Recommendation**
Add tests for `QueryDeploymentsCommand`: a role test (Design/no-role caller →
`ManagementUnauthorized`), branch coverage for the `InstanceId`-filtered and unfiltered
repository calls, and — once finding 014 is fixed — site-scope tests for a site-scoped
Deployment user against in-scope and out-of-scope deployment records.
**Resolution**
_Unresolved._

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.NotificationService` |
| Design doc | `docs/requirements/Component-NotificationService.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 5 |
## Summary
@@ -30,6 +30,31 @@ plaintext `string` values. Test coverage exercises the happy path and the main
error branches but misses the OAuth2 delivery path, the permanent-classification
fallback in `DeliverAsync`, and concurrency on the token cache.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the NS-001..013 fixes. All twelve prior
findings remain closed (eleven Resolved, NS-013 Deferred) and the fixes hold up
against the current source — error classification is now typed, the SMTP client is
created once and always disconnected/disposed, the S&F `Notification` delivery
handler is registered in `AkkaHostedService`, and the OAuth2 token cache is keyed
per credential. The re-review surfaced **five new findings**. The recurring theme is
**unclassified-exception handling**: the typed `ClassifySmtpError` helper introduced
by NS-002/003 returns `Unknown` for anything that is not a recognised SMTP/socket
failure, but neither `SendAsync` nor `DeliverBufferedAsync` has a catch-all for the
`Unknown` bucket. As a result an OAuth2 token-fetch failure (`HttpRequestException`
from `EnsureSuccessStatusCode`, or `InvalidOperationException` from a malformed
credential string) escapes `SendAsync` as a raw exception to the calling script
(NS-015) and escapes `DeliverBufferedAsync` out of the S&F handler, where the engine
treats any thrown exception as transient and retries a permanently-broken config
forever (NS-014) — the same defect class NS-008 fixed for address parsing, but the
OAuth2 path was never covered. Secondary findings: `AuthenticateAsync` silently
proceeds unauthenticated for an unknown `authType` or empty credentials (NS-016);
`NotificationOptions` is bound from configuration in two places but never read by
any code (NS-017, dead config — NS-007 sourced the timeout/limit from
`SmtpConfiguration` instead); and the lazily-created concurrency limiter is read
outside its lock, is sized once and never resized on redeployment, and is never
disposed (NS-018).
## Checklist coverage
| # | Category | Examined | Notes |
@@ -465,3 +490,108 @@ filed) and the genuinely-missing tests were added:
`AkkaHostedService` handler registration), so no further test is required here.
Module test suite is green at 56 tests.
### NotificationService-014 — OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:214-228`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:56-84` |
**Description**
`DeliverBufferedAsync` is the registered Store-and-Forward delivery handler for `StoreAndForwardCategory.Notification`. Per the S&F handler contract (`StoreAndForwardService.cs:40-53`), a handler must return `true` for success, `false` for permanent failure (park, no further retries), and **throw only for a transient failure** (retry). `DeliverBufferedAsync`'s `try` block (`:214-228`) catches only `SmtpPermanentException`; the comment at `:227` asserts "Transient SMTP errors propagate out of `DeliverAsync`". But `DeliverAsync` can also throw exceptions that `ClassifySmtpError` buckets as `Unknown` — most importantly the OAuth2 path at `:308-312`, where `OAuth2TokenService.GetTokenAsync` throws `HttpRequestException` (from `EnsureSuccessStatusCode` at `OAuth2TokenService.cs:78`) or `InvalidOperationException` (malformed credential triple, or missing `access_token`). None of these is an `SmtpCommandException`/`SmtpProtocolException`/`SocketException`/`TimeoutException`, so they are neither caught nor classified — they propagate straight out of `DeliverBufferedAsync`. The S&F engine treats *any* thrown exception as a transient failure, so a notification whose SMTP config has an invalid client secret (a 401 → `HttpRequestException`, a genuinely **permanent** condition) is retried on every sweep until `MaxRetries` is reached — burning OAuth2 token-endpoint calls and never parking promptly. A malformed credential string (`InvalidOperationException`, never fixable) is likewise retried instead of parked immediately.
**Recommendation**
Add a catch-all to `DeliverBufferedAsync` for exceptions that `ClassifySmtpError` classifies as `Unknown`: log and return `false` (park) for non-retryable causes such as `InvalidOperationException` from credential parsing, and decide deliberately whether an OAuth2 `HttpRequestException` should park or retry (a 5xx token endpoint is transient; a 401 is permanent — inspect `HttpRequestException.StatusCode`). Do not let an unclassified exception leave the handler and be silently reinterpreted as transient.
**Resolution**
_Unresolved._
### NotificationService-015 — Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script
| | |
|--|--|
| Severity | High |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96-148`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312` |
**Description**
`SendAsync`'s `try`/`catch` around `DeliverAsync` has exactly three catch clauses: `SmtpPermanentException` (`:101`), `OperationCanceledException` when cancellation was requested (`:111`), and `Exception` filtered by `IsTransientSmtpError` (`:116`). There is no catch-all for an exception that is none of those. `ClassifySmtpError` returns `SmtpErrorClass.Unknown` for anything it does not recognise, and `IsTransientSmtpError` only matches `Transient`, so an `Unknown`-classified exception falls through all three clauses and escapes `SendAsync` unhandled. Concretely, the OAuth2 branch at `:308-312` calls `OAuth2TokenService.GetTokenAsync`, which throws `HttpRequestException` (token endpoint unreachable / 4xx / 5xx) or `InvalidOperationException` (malformed `tenant:client:secret` triple, or no `access_token` in the response). Both reach the calling script (an instance/alarm/shared script) as a raw exception of a type the `INotificationDeliveryService` contract never advertises — the design doc says delivery returns either success, a buffered transient result, or a permanent `NotificationResult` error. `Notify.To().Send()` against an OAuth2 config with a bad secret therefore crashes the script instead of returning a clean `NotificationResult(false, ...)`. This is the same defect class NS-008 resolved for malformed addresses, but NS-008's `ValidateAddresses` fix never covered the OAuth2 token path. (A non-cancellation `OperationCanceledException` — e.g. an internal MailKit timeout surfaced as OCE while the caller's token is *not* cancelled — would escape the same way.)
**Recommendation**
Add a final `catch (Exception ex)` to `SendAsync` that converts any otherwise-unhandled exception into a permanent `NotificationResult(false, ...)` (credential-scrubbed, consistent with NS-009), and log it. Treating an unclassified failure as permanent is the safe default — it returns control to the script rather than crashing it; an OAuth2 transient (token endpoint 5xx) being reported as permanent is an acceptable trade against an unhandled crash, and can be refined by inspecting `HttpRequestException.StatusCode` if transient buffering of token failures is wanted.
**Resolution**
_Unresolved._
### NotificationService-016 — `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:46-67` |
**Description**
`MailKitSmtpClientWrapper.AuthenticateAsync` returns without authenticating in two silent paths. (1) `if (string.IsNullOrEmpty(credentials)) return;` (`:48-49`) — if an SMTP config has `AuthType` "basic" or "oauth2" but a null/empty `Credentials` value (a misconfigured or partially-deployed row), the method returns and `DeliverAsync` proceeds straight to `SendAsync` on an *unauthenticated* connection. (2) The `switch (authType.ToLowerInvariant())` (`:51-66`) has cases only for `"basic"` and `"oauth2"` and **no `default`** — any other `AuthType` value (a typo, a future "ntlm", an empty string) falls through with no authentication attempted. Additionally, in the `"basic"` case, `credentials.Split(':', 2)` that yields fewer than 2 parts (a credential string with no colon) is silently skipped (`:55-58`). In every case the connection then attempts to send mail unauthenticated: against a relay that does not require auth this masks a misconfiguration; against a server that does require auth it fails later with a less obvious error, and at worst an unauthenticated send succeeds where it should not. Authentication being skipped should never be silent.
**Recommendation**
Make missing/unparseable credentials and an unrecognised `AuthType` hard errors: throw a typed exception (e.g. a permanent configuration exception so `SendAsync`/`DeliverBufferedAsync` surface it as a clean failure / park) rather than returning. Add a `default:` arm to the `switch` that throws `ArgumentException` naming the unsupported auth type, and reject a "basic" credential string that does not split into exactly two parts.
**Resolution**
_Unresolved._
### NotificationService-017 — `NotificationOptions` is bound from configuration but never read (dead config)
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:1-15`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:10-11`, `src/ScadaLink.Host/SiteServiceRegistration.cs:70` |
**Description**
`NotificationOptions` (with `ConnectionTimeoutSeconds` and `MaxConcurrentConnections`) is bound from the `ScadaLink:Notification` configuration section in two places — `ServiceCollectionExtensions.AddNotificationService` (`AddOptions<NotificationOptions>().BindConfiguration(...)`) and again in `Host/SiteServiceRegistration.cs:70` (`services.Configure<NotificationOptions>(...)`). However, a repo-wide search shows no code ever injects `IOptions<NotificationOptions>` or otherwise reads either property. When NS-007 enforced the connection timeout and concurrency limit, it sourced both values from the per-site `SmtpConfiguration` entity, not from `NotificationOptions` — so this options class is now entirely dead configuration. Its XML doc still claims it "provides fallback defaults and operational limits", which is misleading: nothing falls back to it. The double binding is also redundant. Dead, falsely-documented configuration invites an operator to set `ScadaLink:Notification:ConnectionTimeoutSeconds` and expect it to take effect, when it has no effect at all.
**Recommendation**
Either delete `NotificationOptions` and both of its registrations, or genuinely wire it in as the documented fallback (e.g. `DeliverAsync` uses `NotificationOptions` values when the `SmtpConfiguration` field is non-positive). If kept, remove the duplicate `Configure` call and correct the XML doc. The NS-007 resolution note already states the `NotificationOptions` fields "remain as operational fallback defaults" — that intent should be implemented or the class removed.
**Resolution**
_Unresolved._
### NotificationService-018 — Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:237-255` |
**Description**
Three minor issues with the NS-007 concurrency limiter. (1) `GetConcurrencyLimiter` reads the plain (non-`volatile`) field `_concurrencyLimiter` at `:242` *outside* `_limiterLock` as a fast path. The field is only written inside the lock (`:252`), so under the .NET memory model a concurrent thread is not guaranteed to observe the fully-constructed `SemaphoreSlim` via the lock-free read. In practice on x86/x64 this is benign and the double-checked write inside the lock is correct, but the unsynchronized read of a reference set under a lock is the classic lazy-init memory-model pitfall — use `Volatile.Read`/`LazyInitializer`/`Lazy<T>`. (2) The limiter is sized once from the *first* `SmtpConfiguration` ever seen and the `??=` at `:252` means it is never re-created. The design doc says one SMTP config is deployed per site, but that config can be **redeployed** with a different `MaxConcurrentConnections`; the new value is silently ignored for the lifetime of the (scoped, but potentially long-lived per request) service — and more importantly the limit is captured per `NotificationDeliveryService` instance, so behaviour depends on which scoped instance happens to create it first. (3) `SemaphoreSlim` implements `IDisposable`; `_concurrencyLimiter` is never disposed and `NotificationDeliveryService` does not implement `IDisposable`. A scoped service leaking one undisposed `SemaphoreSlim` per scope is a slow handle leak.
**Recommendation**
Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyInitializer.EnsureInitialized`, or at minimum make the field access `Volatile.Read`/`Volatile.Write`. Consider hoisting the limiter to a singleton keyed by site (the limit is a per-site invariant, not a per-scope one) so redeployment with a changed limit and disposal are handled in one place; if it stays on the scoped service, implement `IDisposable` to dispose it.
**Resolution**
_Unresolved._

View File

@@ -40,10 +40,10 @@ module file and counted in **Total**.
| Severity | Open findings |
|----------|---------------|
| Critical | 0 |
| High | 5 |
| Medium | 12 |
| Low | 17 |
| **Total** | **34** |
| High | 8 |
| Medium | 20 |
| Low | 27 |
| **Total** | **55** |
## Module Status
@@ -59,11 +59,11 @@ module file and counted in **Total**.
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 15 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/3/1 | 4 | 17 |
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 |
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 |
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
@@ -80,7 +80,7 @@ description, location, recommendation — lives in the module's `findings.md`.
_None open._
### High (5)
### High (8)
| ID | Module | Title |
|----|--------|-------|
@@ -89,8 +89,11 @@ _None open._
| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
| ManagementService-014 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments bypasses site-scope enforcement |
| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever |
| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script |
### Medium (12)
### Medium (20)
| ID | Module | Title |
|----|--------|-------|
@@ -106,8 +109,16 @@ _None open._
| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
| Host-012 | [Host](Host/findings.md) | `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read |
| InboundAPI-014 | [InboundAPI](InboundAPI/findings.md) | `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated |
| InboundAPI-015 | [InboundAPI](InboundAPI/findings.md) | `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token |
| InboundAPI-016 | [InboundAPI](InboundAPI/findings.md) | Routed `Route.To().Call()` invocations are not bound by the method timeout |
| ManagementService-015 | [ManagementService](ManagementService/findings.md) | HandleSetInstanceOverrides applies overrides non-atomically |
| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials |
| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session |
| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout |
### Low (17)
### Low (27)
| ID | Module | Title |
|----|--------|-------|
@@ -128,3 +139,13 @@ _None open._
| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
| Host-013 | [Host](Host/findings.md) | `:F0` rounding of cluster timing values silently degrades sub-second configuration |
| Host-014 | [Host](Host/findings.md) | Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) |
| Host-015 | [Host](Host/findings.md) | `StartupRetry` retries on every exception type, including permanent failures |
| InboundAPI-017 | [InboundAPI](InboundAPI/findings.md) | `RouteHelper` / `RouteTarget` has no test coverage |
| ManagementService-016 | [ManagementService](ManagementService/findings.md) | Unexpected exception messages returned verbatim to HTTP callers |
| ManagementService-017 | [ManagementService](ManagementService/findings.md) | QueryDeploymentsCommand has no test coverage |
| NotificationService-017 | [NotificationService](NotificationService/findings.md) | `NotificationOptions` is bound from configuration but never read (dead config) |
| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed |
| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas |
| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims |

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Security` |
| Design doc | `docs/requirements/Component-Security.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (1 deferred — Security-008) |
| Commit reviewed | `39d737e` |
| Open findings | 4 (1 deferred — Security-008) |
## Summary
@@ -28,6 +28,26 @@ on every refresh, weakening the documented 30-minute idle policy. None of these
crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
defects that should be fixed before any production deployment.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed the module after the batch-resolution of Security-001..007, 009, 010, 011.
Those fixes were verified in place and remain Resolved: the `LdapTransport` enum makes
StartTLS reachable, the auth cookie is `SecurePolicy.Always` with a sliding window, the
JWT signing key is length-validated at construction, issuer/audience are bound and
checked, the DN-injection fallback is RFC 4514-escaped, and the idle-timeout anchor is
carried across refreshes. Security-008 (N+1 scope-rule query) remains correctly
**Deferred**`ISecurityRepository` still exposes no per-set scope-rule query, so the
N+1 cannot be removed from within `RoleMapper.cs` alone; the deferral condition is
unchanged. This pass surfaced **4 new findings** (Security-012..015): two Medium and
two Low. The most notable is that a *partial* LDAP outage during login (bind succeeds,
group search fails) silently returns an authenticated session with **zero roles**
instead of failing the login as the design's LDAP-failure rule requires
(Security-012); and that `RefreshToken` re-issues a token without ever checking
`IsIdleTimedOut`, so an idle-expired session can be renewed indefinitely if the caller
omits the separate idle check (Security-014). The two Low findings concern fragile
DN parsing of group names containing escaped commas and an un-trimmed username flowing
into the LDAP filter, fallback DN, and JWT claims.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -456,3 +476,141 @@ suite. Remaining gaps closed here: renamed the test file `UnitTest1.cs` →
LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases
(`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`,
`EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green.
### Security-012 — Partial LDAP failure during login yields a roleless authenticated session
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:78-118` |
**Description**
In `AuthenticateAsync`, after the user bind succeeds, the group/attribute `Search` is
wrapped in a `try`/`catch (LdapException)` that logs a warning and continues — the
comment states "Auth succeeded even if attribute lookup failed". The method then returns
`new LdapAuthResult(true, displayName, username, groups, null)` with an **empty
`groups`** list. Because `RoleMapper.MapGroupsToRolesAsync` derives all roles from that
group list, a login during a *partial* LDAP outage (bind reachable, search/group server
unavailable, or a transient `LdapException` mid-search) produces a fully authenticated
session with **zero roles** — the user is logged in but silently stripped of all
permissions. The design's LDAP Connection Failure rule states new logins must **fail**
when LDAP is unavailable; this path instead admits the user. The caller cannot
distinguish "user genuinely belongs to no mapped groups" from "the group query failed",
so it cannot make the documented fail-the-login decision. The `while` loop's inner
`catch (LdapException) { break; }` (line 107-111) compounds this: a real error mid-enumeration
is indistinguishable from end-of-results and silently truncates the group list.
**Recommendation**
Treat a failed group/attribute lookup on the *initial login* as an authentication
failure: return `Success = false` with a "directory unavailable, try again" message, or
add an explicit flag on `LdapAuthResult` (e.g. `GroupLookupSucceeded`) so the caller can
apply the design's fail-the-login rule. The inner `while`-loop `catch` should not treat
an arbitrary `LdapException` as a benign end-of-results sentinel — only the specific
"no more results" condition should break the loop; any other exception should propagate
or be surfaced.
**Resolution**
_Unresolved._
### Security-013 — `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:258-269` |
**Description**
`ExtractFirstRdnValue` extracts a group name from a `memberOf` DN by taking the substring
between the first `=` and the first `,`. RFC 4514 permits a `,` inside an RDN value when
it is backslash-escaped — e.g. `cn=Acme\, Inc Operators,ou=groups,dc=example,dc=com`. The
method splits on the *escaped* comma and returns `Acme\` as the group name. `RoleMapper`
then matches that mangled name verbatim against configured `LdapGroupMapping.LdapGroupName`
values, so a group whose CN legitimately contains a comma silently fails to map to its
role — the user loses that role with no error. The same naive parsing also does not strip
the trailing backslash escape.
**Recommendation**
Parse the first RDN with an escape-aware scan: ignore a `,` preceded by an unescaped
`\`, and unescape RFC 4514 sequences in the extracted value. Alternatively use the LDAP
library's DN-parsing API (`LdapDN` / RDN accessors) rather than hand-rolled `IndexOf`.
**Resolution**
_Unresolved._
### Security-014 — `RefreshToken` re-issues a token without checking the idle timeout
| | |
|--|--|
| Severity | Medium |
| Category | Security |
| Status | Open |
| Location | `src/ScadaLink.Security/JwtTokenService.cs:156-169` |
**Description**
`RefreshToken` carries the existing `LastActivity` claim forward (correct, per the
Security-007 fix) and unconditionally issues a fresh token with a new 15-minute expiry.
It never calls `IsIdleTimedOut`. The documented 30-minute idle policy therefore depends
entirely on the CentralUI request pipeline calling `IsIdleTimedOut` *before* calling
`RefreshToken` — two independent, order-dependent checks with no enforcement that they
are used together. If a caller refreshes without first checking idle state (an easy
mistake, and there is no compiler or API signal preventing it), an idle-expired session
is silently renewed: each refresh resets the JWT expiry while `LastActivity` keeps
ageing, so the session can be kept alive indefinitely by background refreshes even
though the user has been idle for hours. `RefreshToken` is the natural place to enforce
the invariant but provides no safety net.
**Recommendation**
Have `RefreshToken` itself reject a principal that is already past the idle timeout —
return `null` (the same "cannot refresh" signal it already uses for missing claims) when
`IsIdleTimedOut(currentPrincipal)` is true — so the idle policy holds regardless of
caller discipline. Document that an idle-expired principal cannot be refreshed and add a
regression test covering refresh of a 31-minute-idle token.
**Resolution**
_Unresolved._
### Security-015 — Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Security/LdapAuthService.cs:20-21`, `:80`, `:122`, `:169`, `:193` |
**Description**
`AuthenticateAsync` rejects null/whitespace-only usernames via `IsNullOrWhiteSpace`, but
an otherwise-valid username with leading or trailing whitespace (`" alice"`, copy-paste
artefacts, mobile keyboards) is passed through verbatim. It flows into the LDAP search
filter `({attr}={EscapeLdapFilter(username)})`, into the fallback bind DN via
`BuildFallbackUserDn`, and — most consequentially — into the returned `LdapAuthResult`
and from there into the JWT `Username` claim. Most LDAP directories ignore surrounding
whitespace on a search term, so `" alice"` and `"alice"` may both authenticate but yield
JWT `Username` claims that differ by whitespace. Any downstream identity comparison
(audit-log `who`, site-scope lookups keyed by username, session de-duplication) then
sees two distinct identities for the same person.
**Recommendation**
Trim the username once at the top of `AuthenticateAsync` (after the `IsNullOrWhiteSpace`
guard) and use the trimmed value consistently for the filter, the DN, and the
`LdapAuthResult`. This normalises the identity that ends up in the JWT claim and audit
trail.
**Resolution**
_Unresolved._