diff --git a/code-reviews/Host/findings.md b/code-reviews/Host/findings.md index 9303782..73e3f65 100644 --- a/code-reviews/Host/findings.md +++ b/code-reviews/Host/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.Host` | | Design doc | `docs/requirements/Component-Host.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 4 | ## Summary @@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss class, but the readiness bug is High because it breaks load-balancer behaviour with no safe workaround. +#### Re-review 2026-05-17 (commit `39d737e`) + +All eleven findings from the first review (Host-001..011) are confirmed `Resolved` +in the current tree — the `/health/ready` predicate, the externalised secrets, the +seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded +migration retry and the live `LoggingOptions.MinimumLevel` are all present as +described in their Resolution notes. This re-review walked all ten checklist +categories again over the full module and recorded four new findings, none of them +crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review +note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes +`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so +that documented, bound option is dead. The remaining three are Low: `:F0` rounding +of cluster timing values silently degrades any sub-second configuration (Host-013), +Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as +REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every +exception type including permanent schema-validation failures (Host-015). + ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| -| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). | -| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). | -| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. | -| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). | -| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). | -| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. | -| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). | -| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). | -| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). | -| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). | +| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). | +| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. | +| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. | +| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). | +| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. | +| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. | +| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). | +| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. | +| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). | +| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). | ## Findings @@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`: — they assert the configured level actually filters log events; they pass against the new factory (and would fail against the old inline configuration, which ignored the key). Full Host suite green (175 passed). + +### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` | + +**Description** + +`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with +`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions` +(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a +`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern, +documented as "the design-doc requirement", default `true` — but a repo-wide search +shows it is referenced **nowhere outside its own declaration**. The Host therefore +ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false` +in `appsettings.json` has no effect, the resolver still runs with `down-if-alone = +on`. This is the same class of defect as the resolved Host-011 +(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is +declared, bound, and documented but never consumed, which silently misleads any +operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same +hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the +ClusterInfrastructure project owns only the configuration contract, per the +`ClusterOptions` class comment). + +**Recommendation** + +Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone = +{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping +is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce +the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed +at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its +doc comment instead so code and configuration contract agree — but do not leave it +declared-but-dead. + +**Resolution** + +_Unresolved._ + +### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` | + +**Description** + +`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` — +`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`, +`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The +`F0` format specifier rounds to whole seconds, so any sub-second configuration is +silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as +`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or +treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole +seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan` +values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with +no warning — or, worse, `0s`. Rounding configured timing values without surfacing +the change is a correctness hazard for exactly the kind of failure-detection tuning +these options exist for. + +**Recommendation** + +Render durations without precision loss — emit milliseconds (e.g. +`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or +validate in `StartupValidator` that each cluster timing value is a positive whole +number of seconds and fail fast otherwise. Either way, do not silently round. + +**Resolution** + +_Unresolved._ + +### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) + +| | | +|--|--| +| Severity | Low | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` | + +**Description** + +REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink +setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls +`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog` +configuration section — but neither `appsettings.json` nor either role-specific file +contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel` +map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that +actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and +`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the +console output template, the file path, and the rolling interval cannot be changed +without recompiling — an operator cannot redirect logs, change the file location, or +add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration` +call is effectively a no-op because the section it reads does not exist. + +**Recommendation** + +Move the console and file sink definitions into a `Serilog` section in +`appsettings.json` (with `WriteTo` entries and the output template / file path / +rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the +hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to +state the sinks are intentionally code-defined — but the design doc currently says +"configuration-driven", so code and doc must be reconciled. + +**Resolution** + +_Unresolved._ + +### Host-015 — `StartupRetry` retries on every exception type, including permanent failures + +| | | +|--|--| +| Severity | Low | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` | + +**Description** + +`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard +`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller +wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which +on a Central node in production *validates* the schema version and throws if the +deployed schema does not match the expected migration set. A schema-version mismatch +is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it +8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes) +before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault +that is fatal on the first attempt. The retry helper is meant for *transient* +connection faults (the XML doc says exactly that: "the database may be briefly +unreachable"), but it cannot distinguish transient from permanent failures. + +**Recommendation** + +Restrict retries to transient faults — e.g. accept an `Func +isTransient` predicate and, for the migration call site, treat only +connection-class exceptions (`SqlException` with a connection/transport error +number, `TimeoutException`, socket errors) as retryable while letting +schema-validation / `InvalidOperationException` failures propagate immediately. Add +a `StartupRetryTests` case asserting a non-transient exception is rethrown after a +single attempt. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/InboundAPI/findings.md b/code-reviews/InboundAPI/findings.md index 1109a47..d8edbd6 100644 --- a/code-reviews/InboundAPI/findings.md +++ b/code-reviews/InboundAPI/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.InboundAPI` | | Design doc | `docs/requirements/Component-InboundAPI.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 4 | ## Summary @@ -30,6 +30,25 @@ well but there is no coverage of the HTTP endpoint, concurrency, or recompilatio None of the findings are data-loss-class, but the concurrency and trust-model issues are High severity and should be addressed before production use. +#### Re-review 2026-05-17 (commit `39d737e`) + +All 13 findings from the initial review remain `Resolved`; the module source under +`src/ScadaLink.InboundAPI` is unchanged since the last InboundAPI fix commit +(`8dd7412`), which precedes `39d737e`. This re-review re-walked all 10 checklist +categories against the resolved code and surfaced **4 new findings** — none touching +the previously-fixed concurrency/trust-model code, but all in areas the first pass +did not probe deeply: (1) the `ReturnDefinition` column is loaded onto `ApiMethod` +but is never consulted — script return values are serialized verbatim with no shaping +or validation against the declared return structure (InboundAPI-014); (2) the new +`ForbiddenApiChecker` is a purely textual syntax walker and can be bypassed by +reaching forbidden functionality through member access that never spells a forbidden +namespace, e.g. `typeof(x).Assembly.GetType("System.IO.File")` (InboundAPI-015); +(3) routed `Route.To().Call()` invocations are not bound by the method timeout unless +the script explicitly threads `Parameters`-side cancellation, contradicting the design +statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteHelper` +/ `RouteTarget` — the entire WP-4 cross-site routing surface — has no test coverage +(InboundAPI-017). New findings are one Medium-trio plus one Low; no Critical or High. + ## Checklist coverage | # | Category | Examined | Notes | @@ -38,11 +57,11 @@ are High severity and should be addressed before production use. | 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. | | 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. | | 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). | -| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. | +| 5 | Security | ☑ | Prior items resolved. Re-review: `ForbiddenApiChecker` is a textual deny-list bypassable via reflection without a forbidden namespace token (InboundAPI-015). | | 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. | -| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". | -| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. | -| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. | +| 7 | Design-document adherence | ☑ | Re-review: `ReturnDefinition` loaded but never used (InboundAPI-014); routed-call timeout not enforced (InboundAPI-016). Prior `Database.Connection()`/central-only items resolved. | +| 8 | Code organization & conventions | ☑ | `ParameterDefinition` moved to Commons (InboundAPI-012 resolved); no new issues. | +| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. | | 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). | ## Findings @@ -580,3 +599,181 @@ the new method-not-found status, and removing dead code cannot regress. Doc-owne follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a "method not found" status; it should note that it is reported as 403 (indistinguishable from "key not approved"), but that doc edit is outside this module's editable scope. + +### InboundAPI-014 — `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:201-205`, `src/ScadaLink.Commons/Entities/InboundApi/ApiMethod.cs:10` | + +**Description** + +`Component-InboundAPI.md` ("API Method Definition → Return Value Definition" and the +"Response Format" section) specifies that each method has a declared return structure +— "Field names and data types … Supports returning lists of objects" — and that the +success response body is "the method's return value as JSON, with fields matching the +return value definition". The `ApiMethod` entity carries a `ReturnDefinition` column +to hold exactly this. However, nothing in the module ever reads `ReturnDefinition`: +`ExecuteAsync` takes whatever object the script happens to return and does a blind +`JsonSerializer.Serialize(result)`. There is no validation that the script's return +value matches the declared shape, no coercion to the declared field types, and no +error when a method returns a structure inconsistent with its definition. A method +whose script returns the wrong shape (or `null` where a structure is required) will +silently emit a malformed 200 response, and the documented return-definition contract +is effectively unenforced. This is the response-side mirror of the parameter +validation that `ParameterValidator` does perform, leaving the two halves of the +method contract asymmetric. + +**Recommendation** + +Either (a) implement return-value validation/shaping: parse `ReturnDefinition` with +the same extended-type machinery used for parameters and validate/coerce the script +result before serializing, returning a 500 (or logging) when the script result does +not match; or (b) if return shaping is deliberately out of scope, remove the "Return +Value Definition" / "fields matching the return value definition" language from +`Component-InboundAPI.md` and document that the response is the script's raw return +value serialized as-is. Code and design doc must be reconciled. + +**Resolution** + +_Unresolved._ + +### InboundAPI-015 — `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/ForbiddenApiChecker.cs:63-119`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:109-126` | + +**Description** + +`ForbiddenApiChecker` walks the script syntax tree and rejects any `using` directive, +`QualifiedNameSyntax`, or `MemberAccessExpressionSyntax` whose textual dotted name +starts with a forbidden namespace prefix (`System.IO`, `System.Diagnostics`, +`System.Reflection`, `System.Net`, etc.). This is a textual match, not a semantic +one, and the trust model it enforces (per InboundAPI-005) is explicitly meant to keep +*untrusted* Design-role scripts away from host APIs. The check can be bypassed because +forbidden functionality is reachable through member access that never spells a +forbidden namespace: + +- `typeof(string).Assembly.GetType("System.IO.File")` — `typeof(string)` is permitted, + `.Assembly` is a `System.Type` property, `.GetType(string)` is a `System.Reflection.Assembly` + method. The string literal `"System.IO.File"` is a string, not a `QualifiedNameSyntax` + or `MemberAccessExpressionSyntax`, so `IsForbidden` never sees it. The script obtains + a `System.IO.File` `Type` and can `InvokeMember`/`GetMethod(...).Invoke(...)` on it — + all via members of permitted types — with no forbidden namespace ever appearing in + the source. `CompileAndRegister` references `typeof(object).Assembly` + (System.Private.CoreLib) in `ScriptOptions`, so every framework type is loadable at + runtime. +- The executor also references the `Microsoft.CSharp.RuntimeBinder` assembly + (`InboundScriptExecutor.cs:116`), enabling the `dynamic` keyword, which further + widens late-bound member access that the static walker cannot see through. + +Because the inbound API script runs on the central node with the host process's +privileges and is authored by the (less-trusted-than-Admin) Design role, a static +textual deny-list gives a false sense of containment. + +**Recommendation** + +Treat the syntax walker as defence-in-depth, not the boundary. Strengthen it where +cheap (flag `Assembly.GetType`, `Type.GetType`, `Activator.CreateInstance`, +`InvokeMember`, and `dynamic` usage), but for real enforcement run compiled scripts +under a genuine boundary — a restricted `AssemblyLoadContext`/AppDomain-equivalent, a +curated reference set that does not expose reflection-to-arbitrary-type, or an +out-of-process sandbox — consistent with however the Site Runtime ultimately enforces +its instance-script trust model. At minimum, document in `Component-InboundAPI.md` +that the current check is best-effort and does not stop a determined script. + +**Resolution** + +_Unresolved._ + +### InboundAPI-016 — Routed `Route.To().Call()` invocations are not bound by the method timeout + +| | | +|--|--| +| Severity | Medium | +| Category | Design-document adherence | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:59-152`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:177`, `:199` | + +**Description** + +`Component-InboundAPI.md` states the per-method timeout "defines the maximum time the +method is allowed to execute (**including any routed calls to sites**)", and the +Routing Behavior section says a routed call "blocks until the site responds or the +**method-level timeout** is reached". The executor builds a linked +`CancellationTokenSource` (`cts`) combining the request-abort token and a dedicated +timeout CTS, and exposes `cts.Token` to the script as `InboundScriptContext.CancellationToken`. +However, every `RouteTarget` method (`Call`, `GetAttribute(s)`, `SetAttribute(s)`) +takes `CancellationToken cancellationToken = default` and the script must *explicitly* +pass the context token for the routed call to honour the timeout. A natural script — +`Route.To("inst").Call("doWork", parameters)` — invokes the routed call with +`CancellationToken.None`. That request flows into `CommunicationService.RouteToCallAsync` +with no cancellation, so the routed call is not bounded by the method timeout at all. +The only timeout guard left is `handler(context).WaitAsync(cts.Token)` in +`ExecuteAsync`: when the method timeout fires, `WaitAsync` returns a cancellation to +the caller, but the underlying script `Task` — and the in-flight `RouteToCallAsync` +awaiting a remote site — keeps running orphaned with no cancellation, holding the +correlation/communication resources until the site eventually responds or its own +transport timeout (if any) fires. The design's guarantee that the method timeout +covers routed calls is therefore not met, and a slow/hung site can leak background +work past the timeout the caller was told bounds the request. + +**Recommendation** + +Make routed calls inherit the method deadline without relying on script discipline: +have `RouteHelper`/`RouteTarget` carry the executing method's `CancellationToken` +(injected by `InboundScriptExecutor` when it constructs the context, e.g. a +`RouteHelper` bound to `cts.Token`) and pass it into every `CommunicationService` +call by default, so `Route.To("x").Call("s", p)` is timeout-bounded with no token +argument. Keep the explicit-token overload for callers that want a tighter bound. +Verify `RouteToCallAsync` and the attribute-routing calls actually observe the token +and abandon the in-flight request when it fires. + +**Resolution** + +_Unresolved._ + +### InboundAPI-017 — `RouteHelper` / `RouteTarget` has no test coverage + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:1-165`, `tests/ScadaLink.InboundAPI.Tests/` | + +**Description** + +`RouteHelper`/`RouteTarget` is the entire WP-4 cross-site routing surface — the +`Route.To().Call()/GetAttribute(s)/SetAttribute(s)` API that inbound API scripts use +to reach instances at any site. It has zero tests: the `ScadaLink.InboundAPI.Tests` +project covers `ApiKeyValidator`, `ParameterValidator`, `InboundScriptExecutor`, and +`InboundApiEndpointFilter`, but no test file exercises `RouteHelper`. Untested +behaviours include site resolution via `IInstanceLocator` (including the +"instance not found / no assigned site" `InvalidOperationException` path at +`RouteHelper.cs:154-164`), the `!response.Success` → `InvalidOperationException` +translation in each routed method, `GetAttribute` delegating to the batch +`GetAttributes` and returning `null` for an absent key, correlation-ID generation, +and `SetAttribute` delegating to `SetAttributes`. These are non-trivial branches +whose failure modes (a thrown exception inside a script) surface to the caller as a +500, so regressions would be silent. + +**Recommendation** + +Add a `RouteHelperTests` suite using substituted `IInstanceLocator` and +`CommunicationService` (the executor tests already substitute `CommunicationService`): +cover the happy path of each routed method, the unresolved-instance throw, the +`!Success` → `InvalidOperationException` mapping, and `GetAttribute` returning `null` +for a missing key. This also gives InboundAPI-016 a regression home if the timeout +wiring is added. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/ManagementService/findings.md b/code-reviews/ManagementService/findings.md index 9645321..687cf5c 100644 --- a/code-reviews/ManagementService/findings.md +++ b/code-reviews/ManagementService/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.ManagementService` | | Design doc | `docs/requirements/Component-ManagementService.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 (1 Deferred — see ManagementService-012) | +| Commit reviewed | `39d737e` | +| Open findings | 4 (1 Deferred — see ManagementService-012) | ## Summary @@ -28,6 +28,24 @@ instances never disposed) and dead/unused configuration. None of the findings ar crash-class, but the site-scope gaps are High severity because they are a real authorization bypass with no workaround. +#### Re-review 2026-05-17 (commit `39d737e`) + +All thirteen prior findings remain correctly closed; the source under +`src/ScadaLink.ManagementService` is byte-identical between the previously reviewed state +and `39d737e` (the resolution commits of findings 001–013 are folded into the history at or +before `39d737e`). ManagementService-012 was re-checked and its **Deferred** status still +holds: `ManagementEnvelope.Command` is still typed `object`, and the marker-interface fix +still belongs in the Commons module, outside this module's edit scope — nothing has changed +to make it actionable here. This re-review re-ran the full 10-category checklist against the +current sources and surfaced **four new findings**. The dominant theme is the same +site-scope authorization gap that findings 001/002 closed: `HandleQueryDeployments` +(`QueryDeploymentsCommand`) was overlooked by that sweep and still performs no site-scope +enforcement, letting a site-scoped Deployment user read deployment history for any site +(014, High). The remaining three are lower severity: a non-atomic multi-override mutation +that can leave an instance partially modified after an error (015, Medium), raw exception +messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and +`QueryDeploymentsCommand` having no test coverage at all (017, Low). + ## Checklist coverage | # | Category | Examined | Notes | @@ -551,3 +569,144 @@ infrastructure, so the testable command-parsing/dispatch logic was extracted int `ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5), `SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed. + +### ManagementService-014 — HandleQueryDeployments bypasses site-scope enforcement + +| | | +|--|--| +| Severity | High | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:306`, `:1174` | + +**Description** + +`QueryDeploymentsCommand` is gated to the `Deployment` role by `GetRequiredRole` +(`ManagementActor.cs:170`–`:177`), and the design document's Authorization section states +"Site scoping is enforced for site-scoped Deployment users" and explicitly lists deployments +among the Deployment-role operations. `HandleQueryDeployments` makes no call to +`EnforceSiteScope` / `EnforceSiteScopeForInstance` / `EnforceSiteScopeForIdentifier`: with no +`InstanceId` it returns `IDeploymentManagerRepository.GetAllDeploymentRecordsAsync()` (every +deployment record across all sites), and with an `InstanceId` it returns that instance's +deployment history with no check that the instance's site is within the caller's permitted +set. A site-scoped Deployment user scoped to site A can therefore enumerate deployment +records for instances at site B — instance IDs, `DeployedBy` (operator usernames), +timestamps, deployment status, and `ErrorMessage` content — by issuing `QueryDeployments` +with or without an out-of-scope `InstanceId`. This is the same authorization-bypass class as +the resolved findings 001/002, on a handler that sweep did not cover; it is `DispatchCommand`'s +only `Deployment`-role handler with no scope enforcement. + +**Recommendation** + +Thread `AuthenticatedUser` into `HandleQueryDeployments` (the dispatch case at line 306 +already has `user` in scope). When `cmd.InstanceId` is supplied, call +`EnforceSiteScopeForInstance` before querying. When it is not supplied, filter the returned +`DeploymentRecord` list to the caller's permitted sites — resolve each record's instance to +its `SiteId` (or join through a site-aware repository query) and drop records for sites +outside `PermittedSiteIds`, mirroring the `HandleListInstances` / `HandleListSites` filter +pattern. Add a regression test for a site-scoped user against in-scope and out-of-scope +instances. + +**Resolution** + +_Unresolved._ + +### ManagementService-015 — HandleSetInstanceOverrides applies overrides non-atomically + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:647`–`:659` | + +**Description** + +`HandleSetInstanceOverrides` iterates `cmd.Overrides` and calls +`InstanceService.SetAttributeOverrideAsync` once per attribute, throwing +`InvalidOperationException` on the first `result.IsSuccess == false`. Each +`SetAttributeOverrideAsync` call persists independently, so if the command supplies five +overrides and the third fails (e.g. an unknown attribute name or a validation error), the +first two overrides are already committed to the configuration database while the caller +receives a `ManagementError`. The instance is left partially mutated in a state the operator +neither sees nor requested, and the per-instance operation lock referenced in the design's +deployment decisions does not protect against this because the partial writes are committed +before the throw. A retry of the same command then re-applies the already-applied overrides. + +**Recommendation** + +Make the multi-override mutation all-or-nothing: either validate every requested override up +front before applying any, or apply all overrides within a single transaction / unit-of-work +so a mid-batch failure rolls back the earlier writes. If `InstanceService` cannot offer a +batch method, at minimum document the partial-application behaviour on `SetInstanceOverridesCommand` +and have the handler report which overrides were applied before the failure so the caller +can reconcile. + +**Resolution** + +_Unresolved._ + +### ManagementService-016 — Unexpected exception messages returned verbatim to HTTP callers + +| | | +|--|--| +| Severity | Low | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:121`–`:131` | + +**Description** + +`MapFault` maps any non-`SiteScopeViolationException` fault to +`new ManagementError(correlationId, cause.Message, "COMMAND_FAILED")`, and +`ManagementEndpoints.HandleRequest` returns that `Error` string directly in the HTTP 400 +body. For handler-thrown `InvalidOperationException`s carrying a curated `result.Error` +message this is intended and safe. But the same path also surfaces the raw `.Message` of +unanticipated exceptions — a `SqlException` (which can include server/database names and +constraint details), a `DbUpdateException`, an `ArgumentException` from `Enum.Parse` on a +malformed `DataType`/`TriggerType` value, or a `NullReferenceException` — straight to the +external CLI/HTTP client. This is a minor internal-detail disclosure surface: the exception +text is already logged server-side with full context, so the client does not need the raw +message. + +**Recommendation** + +Distinguish handler-curated failures from unexpected faults. Have handlers throw a dedicated +exception type (e.g. `ManagementCommandException`) for messages that are safe to surface, and +in `MapFault` return that message for the known type while returning a generic +"An internal error occurred (CorrelationId=...)" string for everything else — the operator +can still correlate to the server log via the correlation ID. + +**Resolution** + +_Unresolved._ + +### ManagementService-017 — QueryDeploymentsCommand has no test coverage + +| | | +|--|--| +| Severity | Low | +| Category | Testing coverage | +| Status | Open | +| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` | + +**Description** + +`QueryDeploymentsCommand` / `HandleQueryDeployments` is exercised by no test in +`ManagementActorTests`. There is no test that it requires the `Deployment` role, no test of +the `InstanceId`-filtered versus unfiltered branches, and — because the handler performs no +site-scope enforcement at all — no test that would have caught finding 014. The deployment +query is one of the operations the design's Authorization section calls out for site +scoping, yet it is the only `Deployment`-role command with neither an authorization test nor +a site-scope test. + +**Recommendation** + +Add tests for `QueryDeploymentsCommand`: a role test (Design/no-role caller → +`ManagementUnauthorized`), branch coverage for the `InstanceId`-filtered and unfiltered +repository calls, and — once finding 014 is fixed — site-scope tests for a site-scoped +Deployment user against in-scope and out-of-scope deployment records. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/NotificationService/findings.md b/code-reviews/NotificationService/findings.md index e48232f..ae3b0e0 100644 --- a/code-reviews/NotificationService/findings.md +++ b/code-reviews/NotificationService/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.NotificationService` | | Design doc | `docs/requirements/Component-NotificationService.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 | +| Commit reviewed | `39d737e` | +| Open findings | 5 | ## Summary @@ -30,6 +30,31 @@ plaintext `string` values. Test coverage exercises the happy path and the main error branches but misses the OAuth2 delivery path, the permanent-classification fallback in `DeliverAsync`, and concurrency on the token cache. +#### Re-review 2026-05-17 (commit `39d737e`) + +Re-reviewed at commit `39d737e` after the NS-001..013 fixes. All twelve prior +findings remain closed (eleven Resolved, NS-013 Deferred) and the fixes hold up +against the current source — error classification is now typed, the SMTP client is +created once and always disconnected/disposed, the S&F `Notification` delivery +handler is registered in `AkkaHostedService`, and the OAuth2 token cache is keyed +per credential. The re-review surfaced **five new findings**. The recurring theme is +**unclassified-exception handling**: the typed `ClassifySmtpError` helper introduced +by NS-002/003 returns `Unknown` for anything that is not a recognised SMTP/socket +failure, but neither `SendAsync` nor `DeliverBufferedAsync` has a catch-all for the +`Unknown` bucket. As a result an OAuth2 token-fetch failure (`HttpRequestException` +from `EnsureSuccessStatusCode`, or `InvalidOperationException` from a malformed +credential string) escapes `SendAsync` as a raw exception to the calling script +(NS-015) and escapes `DeliverBufferedAsync` out of the S&F handler, where the engine +treats any thrown exception as transient and retries a permanently-broken config +forever (NS-014) — the same defect class NS-008 fixed for address parsing, but the +OAuth2 path was never covered. Secondary findings: `AuthenticateAsync` silently +proceeds unauthenticated for an unknown `authType` or empty credentials (NS-016); +`NotificationOptions` is bound from configuration in two places but never read by +any code (NS-017, dead config — NS-007 sourced the timeout/limit from +`SmtpConfiguration` instead); and the lazily-created concurrency limiter is read +outside its lock, is sized once and never resized on redeployment, and is never +disposed (NS-018). + ## Checklist coverage | # | Category | Examined | Notes | @@ -465,3 +490,108 @@ filed) and the genuinely-missing tests were added: `AkkaHostedService` handler registration), so no further test is required here. Module test suite is green at 56 tests. + +### NotificationService-014 — OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:214-228`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:56-84` | + +**Description** + +`DeliverBufferedAsync` is the registered Store-and-Forward delivery handler for `StoreAndForwardCategory.Notification`. Per the S&F handler contract (`StoreAndForwardService.cs:40-53`), a handler must return `true` for success, `false` for permanent failure (park, no further retries), and **throw only for a transient failure** (retry). `DeliverBufferedAsync`'s `try` block (`:214-228`) catches only `SmtpPermanentException`; the comment at `:227` asserts "Transient SMTP errors propagate out of `DeliverAsync`". But `DeliverAsync` can also throw exceptions that `ClassifySmtpError` buckets as `Unknown` — most importantly the OAuth2 path at `:308-312`, where `OAuth2TokenService.GetTokenAsync` throws `HttpRequestException` (from `EnsureSuccessStatusCode` at `OAuth2TokenService.cs:78`) or `InvalidOperationException` (malformed credential triple, or missing `access_token`). None of these is an `SmtpCommandException`/`SmtpProtocolException`/`SocketException`/`TimeoutException`, so they are neither caught nor classified — they propagate straight out of `DeliverBufferedAsync`. The S&F engine treats *any* thrown exception as a transient failure, so a notification whose SMTP config has an invalid client secret (a 401 → `HttpRequestException`, a genuinely **permanent** condition) is retried on every sweep until `MaxRetries` is reached — burning OAuth2 token-endpoint calls and never parking promptly. A malformed credential string (`InvalidOperationException`, never fixable) is likewise retried instead of parked immediately. + +**Recommendation** + +Add a catch-all to `DeliverBufferedAsync` for exceptions that `ClassifySmtpError` classifies as `Unknown`: log and return `false` (park) for non-retryable causes such as `InvalidOperationException` from credential parsing, and decide deliberately whether an OAuth2 `HttpRequestException` should park or retry (a 5xx token endpoint is transient; a 401 is permanent — inspect `HttpRequestException.StatusCode`). Do not let an unclassified exception leave the handler and be silently reinterpreted as transient. + +**Resolution** + +_Unresolved._ + +### NotificationService-015 — Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script + +| | | +|--|--| +| Severity | High | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96-148`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312` | + +**Description** + +`SendAsync`'s `try`/`catch` around `DeliverAsync` has exactly three catch clauses: `SmtpPermanentException` (`:101`), `OperationCanceledException` when cancellation was requested (`:111`), and `Exception` filtered by `IsTransientSmtpError` (`:116`). There is no catch-all for an exception that is none of those. `ClassifySmtpError` returns `SmtpErrorClass.Unknown` for anything it does not recognise, and `IsTransientSmtpError` only matches `Transient`, so an `Unknown`-classified exception falls through all three clauses and escapes `SendAsync` unhandled. Concretely, the OAuth2 branch at `:308-312` calls `OAuth2TokenService.GetTokenAsync`, which throws `HttpRequestException` (token endpoint unreachable / 4xx / 5xx) or `InvalidOperationException` (malformed `tenant:client:secret` triple, or no `access_token` in the response). Both reach the calling script (an instance/alarm/shared script) as a raw exception of a type the `INotificationDeliveryService` contract never advertises — the design doc says delivery returns either success, a buffered transient result, or a permanent `NotificationResult` error. `Notify.To().Send()` against an OAuth2 config with a bad secret therefore crashes the script instead of returning a clean `NotificationResult(false, ...)`. This is the same defect class NS-008 resolved for malformed addresses, but NS-008's `ValidateAddresses` fix never covered the OAuth2 token path. (A non-cancellation `OperationCanceledException` — e.g. an internal MailKit timeout surfaced as OCE while the caller's token is *not* cancelled — would escape the same way.) + +**Recommendation** + +Add a final `catch (Exception ex)` to `SendAsync` that converts any otherwise-unhandled exception into a permanent `NotificationResult(false, ...)` (credential-scrubbed, consistent with NS-009), and log it. Treating an unclassified failure as permanent is the safe default — it returns control to the script rather than crashing it; an OAuth2 transient (token endpoint 5xx) being reported as permanent is an acceptable trade against an unhandled crash, and can be refined by inspecting `HttpRequestException.StatusCode` if transient buffering of token failures is wanted. + +**Resolution** + +_Unresolved._ + +### NotificationService-016 — `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:46-67` | + +**Description** + +`MailKitSmtpClientWrapper.AuthenticateAsync` returns without authenticating in two silent paths. (1) `if (string.IsNullOrEmpty(credentials)) return;` (`:48-49`) — if an SMTP config has `AuthType` "basic" or "oauth2" but a null/empty `Credentials` value (a misconfigured or partially-deployed row), the method returns and `DeliverAsync` proceeds straight to `SendAsync` on an *unauthenticated* connection. (2) The `switch (authType.ToLowerInvariant())` (`:51-66`) has cases only for `"basic"` and `"oauth2"` and **no `default`** — any other `AuthType` value (a typo, a future "ntlm", an empty string) falls through with no authentication attempted. Additionally, in the `"basic"` case, `credentials.Split(':', 2)` that yields fewer than 2 parts (a credential string with no colon) is silently skipped (`:55-58`). In every case the connection then attempts to send mail unauthenticated: against a relay that does not require auth this masks a misconfiguration; against a server that does require auth it fails later with a less obvious error, and at worst an unauthenticated send succeeds where it should not. Authentication being skipped should never be silent. + +**Recommendation** + +Make missing/unparseable credentials and an unrecognised `AuthType` hard errors: throw a typed exception (e.g. a permanent configuration exception so `SendAsync`/`DeliverBufferedAsync` surface it as a clean failure / park) rather than returning. Add a `default:` arm to the `switch` that throws `ArgumentException` naming the unsupported auth type, and reject a "basic" credential string that does not split into exactly two parts. + +**Resolution** + +_Unresolved._ + +### NotificationService-017 — `NotificationOptions` is bound from configuration but never read (dead config) + +| | | +|--|--| +| Severity | Low | +| Category | Code organization & conventions | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:1-15`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:10-11`, `src/ScadaLink.Host/SiteServiceRegistration.cs:70` | + +**Description** + +`NotificationOptions` (with `ConnectionTimeoutSeconds` and `MaxConcurrentConnections`) is bound from the `ScadaLink:Notification` configuration section in two places — `ServiceCollectionExtensions.AddNotificationService` (`AddOptions().BindConfiguration(...)`) and again in `Host/SiteServiceRegistration.cs:70` (`services.Configure(...)`). However, a repo-wide search shows no code ever injects `IOptions` or otherwise reads either property. When NS-007 enforced the connection timeout and concurrency limit, it sourced both values from the per-site `SmtpConfiguration` entity, not from `NotificationOptions` — so this options class is now entirely dead configuration. Its XML doc still claims it "provides fallback defaults and operational limits", which is misleading: nothing falls back to it. The double binding is also redundant. Dead, falsely-documented configuration invites an operator to set `ScadaLink:Notification:ConnectionTimeoutSeconds` and expect it to take effect, when it has no effect at all. + +**Recommendation** + +Either delete `NotificationOptions` and both of its registrations, or genuinely wire it in as the documented fallback (e.g. `DeliverAsync` uses `NotificationOptions` values when the `SmtpConfiguration` field is non-positive). If kept, remove the duplicate `Configure` call and correct the XML doc. The NS-007 resolution note already states the `NotificationOptions` fields "remain as operational fallback defaults" — that intent should be implemented or the class removed. + +**Resolution** + +_Unresolved._ + +### NotificationService-018 — Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed + +| | | +|--|--| +| Severity | Low | +| Category | Concurrency & thread safety | +| Status | Open | +| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:237-255` | + +**Description** + +Three minor issues with the NS-007 concurrency limiter. (1) `GetConcurrencyLimiter` reads the plain (non-`volatile`) field `_concurrencyLimiter` at `:242` *outside* `_limiterLock` as a fast path. The field is only written inside the lock (`:252`), so under the .NET memory model a concurrent thread is not guaranteed to observe the fully-constructed `SemaphoreSlim` via the lock-free read. In practice on x86/x64 this is benign and the double-checked write inside the lock is correct, but the unsynchronized read of a reference set under a lock is the classic lazy-init memory-model pitfall — use `Volatile.Read`/`LazyInitializer`/`Lazy`. (2) The limiter is sized once from the *first* `SmtpConfiguration` ever seen and the `??=` at `:252` means it is never re-created. The design doc says one SMTP config is deployed per site, but that config can be **redeployed** with a different `MaxConcurrentConnections`; the new value is silently ignored for the lifetime of the (scoped, but potentially long-lived per request) service — and more importantly the limit is captured per `NotificationDeliveryService` instance, so behaviour depends on which scoped instance happens to create it first. (3) `SemaphoreSlim` implements `IDisposable`; `_concurrencyLimiter` is never disposed and `NotificationDeliveryService` does not implement `IDisposable`. A scoped service leaking one undisposed `SemaphoreSlim` per scope is a slow handle leak. + +**Recommendation** + +Replace the hand-rolled double-checked init with `Lazy` or `LazyInitializer.EnsureInitialized`, or at minimum make the field access `Volatile.Read`/`Volatile.Write`. Consider hoisting the limiter to a singleton keyed by site (the limit is a per-site invariant, not a per-scope one) so redeployment with a changed limit and disposal are handled in one place; if it stays on the scoped service, implement `IDisposable` to dispose it. + +**Resolution** + +_Unresolved._ diff --git a/code-reviews/README.md b/code-reviews/README.md index 78e3733..17f5d66 100644 --- a/code-reviews/README.md +++ b/code-reviews/README.md @@ -40,10 +40,10 @@ module file and counted in **Total**. | Severity | Open findings | |----------|---------------| | Critical | 0 | -| High | 5 | -| Medium | 12 | -| Low | 17 | -| **Total** | **34** | +| High | 8 | +| Medium | 20 | +| Low | 27 | +| **Total** | **55** | ## Module Status @@ -59,11 +59,11 @@ module file and counted in **Total**. | [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 | | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 | | [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 | -| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | -| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | -| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | -| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 | -| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | +| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 15 | +| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/3/1 | 4 | 17 | +| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 | +| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 | +| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 | | [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 | | [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 | | [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 | @@ -80,7 +80,7 @@ description, location, recommendation — lives in the module's `findings.md`. _None open._ -### High (5) +### High (8) | ID | Module | Title | |----|--------|-------| @@ -89,8 +89,11 @@ _None open._ | DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger | | DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates | | ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim | +| ManagementService-014 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments bypasses site-scope enforcement | +| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever | +| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script | -### Medium (12) +### Medium (20) | ID | Module | Title | |----|--------|-------| @@ -106,8 +109,16 @@ _None open._ | DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` | | ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process | | HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` | +| Host-012 | [Host](Host/findings.md) | `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read | +| InboundAPI-014 | [InboundAPI](InboundAPI/findings.md) | `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated | +| InboundAPI-015 | [InboundAPI](InboundAPI/findings.md) | `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token | +| InboundAPI-016 | [InboundAPI](InboundAPI/findings.md) | Routed `Route.To().Call()` invocations are not bound by the method timeout | +| ManagementService-015 | [ManagementService](ManagementService/findings.md) | HandleSetInstanceOverrides applies overrides non-atomically | +| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials | +| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session | +| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout | -### Low (17) +### Low (27) | ID | Module | Title | |----|--------|-------| @@ -128,3 +139,13 @@ _None open._ | HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` | | HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service | | HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` | +| Host-013 | [Host](Host/findings.md) | `:F0` rounding of cluster timing values silently degrades sub-second configuration | +| Host-014 | [Host](Host/findings.md) | Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) | +| Host-015 | [Host](Host/findings.md) | `StartupRetry` retries on every exception type, including permanent failures | +| InboundAPI-017 | [InboundAPI](InboundAPI/findings.md) | `RouteHelper` / `RouteTarget` has no test coverage | +| ManagementService-016 | [ManagementService](ManagementService/findings.md) | Unexpected exception messages returned verbatim to HTTP callers | +| ManagementService-017 | [ManagementService](ManagementService/findings.md) | QueryDeploymentsCommand has no test coverage | +| NotificationService-017 | [NotificationService](NotificationService/findings.md) | `NotificationOptions` is bound from configuration but never read (dead config) | +| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed | +| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas | +| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims | diff --git a/code-reviews/Security/findings.md b/code-reviews/Security/findings.md index 4bc5441..69ce213 100644 --- a/code-reviews/Security/findings.md +++ b/code-reviews/Security/findings.md @@ -5,10 +5,10 @@ | Module | `src/ScadaLink.Security` | | Design doc | `docs/requirements/Component-Security.md` | | Status | Reviewed | -| Last reviewed | 2026-05-16 | +| Last reviewed | 2026-05-17 | | Reviewer | claude-agent | -| Commit reviewed | `9c60592` | -| Open findings | 0 (1 deferred — Security-008) | +| Commit reviewed | `39d737e` | +| Open findings | 4 (1 deferred — Security-008) | ## Summary @@ -28,6 +28,26 @@ on every refresh, weakening the documented 30-minute idle policy. None of these crash/data-loss bugs, but the TLS, cookie, and key-validation items are security defects that should be fixed before any production deployment. +#### Re-review 2026-05-17 (commit `39d737e`) + +Re-reviewed the module after the batch-resolution of Security-001..007, 009, 010, 011. +Those fixes were verified in place and remain Resolved: the `LdapTransport` enum makes +StartTLS reachable, the auth cookie is `SecurePolicy.Always` with a sliding window, the +JWT signing key is length-validated at construction, issuer/audience are bound and +checked, the DN-injection fallback is RFC 4514-escaped, and the idle-timeout anchor is +carried across refreshes. Security-008 (N+1 scope-rule query) remains correctly +**Deferred** — `ISecurityRepository` still exposes no per-set scope-rule query, so the +N+1 cannot be removed from within `RoleMapper.cs` alone; the deferral condition is +unchanged. This pass surfaced **4 new findings** (Security-012..015): two Medium and +two Low. The most notable is that a *partial* LDAP outage during login (bind succeeds, +group search fails) silently returns an authenticated session with **zero roles** +instead of failing the login as the design's LDAP-failure rule requires +(Security-012); and that `RefreshToken` re-issues a token without ever checking +`IsIdleTimedOut`, so an idle-expired session can be renewed indefinitely if the caller +omits the separate idle check (Security-014). The two Low findings concern fragile +DN parsing of group names containing escaped commas and an un-trimmed username flowing +into the LDAP filter, fallback DN, and JWT claims. + ## Checklist coverage | # | Category | Examined | Notes | @@ -456,3 +476,141 @@ suite. Remaining gaps closed here: renamed the test file `UnitTest1.cs` → LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases (`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`, `EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green. + +### Security-012 — Partial LDAP failure during login yields a roleless authenticated session + +| | | +|--|--| +| Severity | Medium | +| Category | Error handling & resilience | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:78-118` | + +**Description** + +In `AuthenticateAsync`, after the user bind succeeds, the group/attribute `Search` is +wrapped in a `try`/`catch (LdapException)` that logs a warning and continues — the +comment states "Auth succeeded even if attribute lookup failed". The method then returns +`new LdapAuthResult(true, displayName, username, groups, null)` with an **empty +`groups`** list. Because `RoleMapper.MapGroupsToRolesAsync` derives all roles from that +group list, a login during a *partial* LDAP outage (bind reachable, search/group server +unavailable, or a transient `LdapException` mid-search) produces a fully authenticated +session with **zero roles** — the user is logged in but silently stripped of all +permissions. The design's LDAP Connection Failure rule states new logins must **fail** +when LDAP is unavailable; this path instead admits the user. The caller cannot +distinguish "user genuinely belongs to no mapped groups" from "the group query failed", +so it cannot make the documented fail-the-login decision. The `while` loop's inner +`catch (LdapException) { break; }` (line 107-111) compounds this: a real error mid-enumeration +is indistinguishable from end-of-results and silently truncates the group list. + +**Recommendation** + +Treat a failed group/attribute lookup on the *initial login* as an authentication +failure: return `Success = false` with a "directory unavailable, try again" message, or +add an explicit flag on `LdapAuthResult` (e.g. `GroupLookupSucceeded`) so the caller can +apply the design's fail-the-login rule. The inner `while`-loop `catch` should not treat +an arbitrary `LdapException` as a benign end-of-results sentinel — only the specific +"no more results" condition should break the loop; any other exception should propagate +or be surfaced. + +**Resolution** + +_Unresolved._ + +### Security-013 — `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:258-269` | + +**Description** + +`ExtractFirstRdnValue` extracts a group name from a `memberOf` DN by taking the substring +between the first `=` and the first `,`. RFC 4514 permits a `,` inside an RDN value when +it is backslash-escaped — e.g. `cn=Acme\, Inc Operators,ou=groups,dc=example,dc=com`. The +method splits on the *escaped* comma and returns `Acme\` as the group name. `RoleMapper` +then matches that mangled name verbatim against configured `LdapGroupMapping.LdapGroupName` +values, so a group whose CN legitimately contains a comma silently fails to map to its +role — the user loses that role with no error. The same naive parsing also does not strip +the trailing backslash escape. + +**Recommendation** + +Parse the first RDN with an escape-aware scan: ignore a `,` preceded by an unescaped +`\`, and unescape RFC 4514 sequences in the extracted value. Alternatively use the LDAP +library's DN-parsing API (`LdapDN` / RDN accessors) rather than hand-rolled `IndexOf`. + +**Resolution** + +_Unresolved._ + +### Security-014 — `RefreshToken` re-issues a token without checking the idle timeout + +| | | +|--|--| +| Severity | Medium | +| Category | Security | +| Status | Open | +| Location | `src/ScadaLink.Security/JwtTokenService.cs:156-169` | + +**Description** + +`RefreshToken` carries the existing `LastActivity` claim forward (correct, per the +Security-007 fix) and unconditionally issues a fresh token with a new 15-minute expiry. +It never calls `IsIdleTimedOut`. The documented 30-minute idle policy therefore depends +entirely on the CentralUI request pipeline calling `IsIdleTimedOut` *before* calling +`RefreshToken` — two independent, order-dependent checks with no enforcement that they +are used together. If a caller refreshes without first checking idle state (an easy +mistake, and there is no compiler or API signal preventing it), an idle-expired session +is silently renewed: each refresh resets the JWT expiry while `LastActivity` keeps +ageing, so the session can be kept alive indefinitely by background refreshes even +though the user has been idle for hours. `RefreshToken` is the natural place to enforce +the invariant but provides no safety net. + +**Recommendation** + +Have `RefreshToken` itself reject a principal that is already past the idle timeout — +return `null` (the same "cannot refresh" signal it already uses for missing claims) when +`IsIdleTimedOut(currentPrincipal)` is true — so the idle policy holds regardless of +caller discipline. Document that an idle-expired principal cannot be refreshed and add a +regression test covering refresh of a 31-minute-idle token. + +**Resolution** + +_Unresolved._ + +### Security-015 — Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims + +| | | +|--|--| +| Severity | Low | +| Category | Correctness & logic bugs | +| Status | Open | +| Location | `src/ScadaLink.Security/LdapAuthService.cs:20-21`, `:80`, `:122`, `:169`, `:193` | + +**Description** + +`AuthenticateAsync` rejects null/whitespace-only usernames via `IsNullOrWhiteSpace`, but +an otherwise-valid username with leading or trailing whitespace (`" alice"`, copy-paste +artefacts, mobile keyboards) is passed through verbatim. It flows into the LDAP search +filter `({attr}={EscapeLdapFilter(username)})`, into the fallback bind DN via +`BuildFallbackUserDn`, and — most consequentially — into the returned `LdapAuthResult` +and from there into the JWT `Username` claim. Most LDAP directories ignore surrounding +whitespace on a search term, so `" alice"` and `"alice"` may both authenticate but yield +JWT `Username` claims that differ by whitespace. Any downstream identity comparison +(audit-log `who`, site-scope lookups keyed by username, session de-duplication) then +sees two distinct identities for the same person. + +**Recommendation** + +Trim the username once at the top of `AuthenticateAsync` (after the `IsNullOrWhiteSpace` +guard) and use the trimmed value consistently for the filter, the DN, and the +`LdapAuthResult`. This normalises the identity that ends up in the JWT claim and audit +trail. + +**Resolution** + +_Unresolved._