docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security
21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
This commit is contained in:
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.Host` |
|
| Module | `src/ScadaLink.Host` |
|
||||||
| Design doc | `docs/requirements/Component-Host.md` |
|
| Design doc | `docs/requirements/Component-Host.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 4 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
|
|||||||
class, but the readiness bug is High because it breaks load-balancer behaviour with
|
class, but the readiness bug is High because it breaks load-balancer behaviour with
|
||||||
no safe workaround.
|
no safe workaround.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
|
||||||
|
in the current tree — the `/health/ready` predicate, the externalised secrets, the
|
||||||
|
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
|
||||||
|
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
|
||||||
|
described in their Resolution notes. This re-review walked all ten checklist
|
||||||
|
categories again over the full module and recorded four new findings, none of them
|
||||||
|
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
|
||||||
|
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
|
||||||
|
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
|
||||||
|
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
|
||||||
|
of cluster timing values silently degrades any sub-second configuration (Host-013),
|
||||||
|
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
|
||||||
|
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
|
||||||
|
exception type including permanent schema-validation failures (Host-015).
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
|---|----------|----------|-------|
|
|---|----------|----------|-------|
|
||||||
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
|
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
|
||||||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
|
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
|
||||||
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||||||
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
|
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
|
||||||
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
|
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
|
||||||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
|
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
|
||||||
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
|
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
|
||||||
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
|
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
|
||||||
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
|
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
|
||||||
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
|
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
|
||||||
|
|
||||||
## Findings
|
## Findings
|
||||||
|
|
||||||
@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
|
|||||||
— they assert the configured level actually filters log events; they pass against the
|
— they assert the configured level actually filters log events; they pass against the
|
||||||
new factory (and would fail against the old inline configuration, which ignored the
|
new factory (and would fail against the old inline configuration, which ignored the
|
||||||
key). Full Host suite green (175 passed).
|
key). Full Host suite green (175 passed).
|
||||||
|
|
||||||
|
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Design-document adherence |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
|
||||||
|
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
|
||||||
|
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
|
||||||
|
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
|
||||||
|
documented as "the design-doc requirement", default `true` — but a repo-wide search
|
||||||
|
shows it is referenced **nowhere outside its own declaration**. The Host therefore
|
||||||
|
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
|
||||||
|
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
|
||||||
|
on`. This is the same class of defect as the resolved Host-011
|
||||||
|
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
|
||||||
|
declared, bound, and documented but never consumed, which silently misleads any
|
||||||
|
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
|
||||||
|
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
|
||||||
|
ClusterInfrastructure project owns only the configuration contract, per the
|
||||||
|
`ClusterOptions` class comment).
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
|
||||||
|
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
|
||||||
|
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
|
||||||
|
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
|
||||||
|
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
|
||||||
|
doc comment instead so code and configuration contract agree — but do not leave it
|
||||||
|
declared-but-dead.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` —
|
||||||
|
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
|
||||||
|
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
|
||||||
|
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
|
||||||
|
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
|
||||||
|
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
|
||||||
|
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
|
||||||
|
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
|
||||||
|
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
|
||||||
|
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
|
||||||
|
the change is a correctness hazard for exactly the kind of failure-detection tuning
|
||||||
|
these options exist for.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Render durations without precision loss — emit milliseconds (e.g.
|
||||||
|
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
|
||||||
|
validate in `StartupValidator` that each cluster timing value is a positive whole
|
||||||
|
number of seconds and fail fast otherwise. Either way, do not silently round.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Design-document adherence |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
|
||||||
|
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
|
||||||
|
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
|
||||||
|
configuration section — but neither `appsettings.json` nor either role-specific file
|
||||||
|
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
|
||||||
|
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
|
||||||
|
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
|
||||||
|
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
|
||||||
|
console output template, the file path, and the rolling interval cannot be changed
|
||||||
|
without recompiling — an operator cannot redirect logs, change the file location, or
|
||||||
|
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
|
||||||
|
call is effectively a no-op because the section it reads does not exist.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Move the console and file sink definitions into a `Serilog` section in
|
||||||
|
`appsettings.json` (with `WriteTo` entries and the output template / file path /
|
||||||
|
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
|
||||||
|
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
|
||||||
|
state the sinks are intentionally code-defined — but the design doc currently says
|
||||||
|
"configuration-driven", so code and doc must be reconciled.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
|
||||||
|
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
|
||||||
|
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
|
||||||
|
on a Central node in production *validates* the schema version and throws if the
|
||||||
|
deployed schema does not match the expected migration set. A schema-version mismatch
|
||||||
|
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
|
||||||
|
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
|
||||||
|
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
|
||||||
|
that is fatal on the first attempt. The retry helper is meant for *transient*
|
||||||
|
connection faults (the XML doc says exactly that: "the database may be briefly
|
||||||
|
unreachable"), but it cannot distinguish transient from permanent failures.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
|
||||||
|
isTransient` predicate and, for the migration call site, treat only
|
||||||
|
connection-class exceptions (`SqlException` with a connection/transport error
|
||||||
|
number, `TimeoutException`, socket errors) as retryable while letting
|
||||||
|
schema-validation / `InvalidOperationException` failures propagate immediately. Add
|
||||||
|
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
|
||||||
|
single attempt.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.InboundAPI` |
|
| Module | `src/ScadaLink.InboundAPI` |
|
||||||
| Design doc | `docs/requirements/Component-InboundAPI.md` |
|
| Design doc | `docs/requirements/Component-InboundAPI.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 4 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -30,6 +30,25 @@ well but there is no coverage of the HTTP endpoint, concurrency, or recompilatio
|
|||||||
None of the findings are data-loss-class, but the concurrency and trust-model issues
|
None of the findings are data-loss-class, but the concurrency and trust-model issues
|
||||||
are High severity and should be addressed before production use.
|
are High severity and should be addressed before production use.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
All 13 findings from the initial review remain `Resolved`; the module source under
|
||||||
|
`src/ScadaLink.InboundAPI` is unchanged since the last InboundAPI fix commit
|
||||||
|
(`8dd7412`), which precedes `39d737e`. This re-review re-walked all 10 checklist
|
||||||
|
categories against the resolved code and surfaced **4 new findings** — none touching
|
||||||
|
the previously-fixed concurrency/trust-model code, but all in areas the first pass
|
||||||
|
did not probe deeply: (1) the `ReturnDefinition` column is loaded onto `ApiMethod`
|
||||||
|
but is never consulted — script return values are serialized verbatim with no shaping
|
||||||
|
or validation against the declared return structure (InboundAPI-014); (2) the new
|
||||||
|
`ForbiddenApiChecker` is a purely textual syntax walker and can be bypassed by
|
||||||
|
reaching forbidden functionality through member access that never spells a forbidden
|
||||||
|
namespace, e.g. `typeof(x).Assembly.GetType("System.IO.File")` (InboundAPI-015);
|
||||||
|
(3) routed `Route.To().Call()` invocations are not bound by the method timeout unless
|
||||||
|
the script explicitly threads `Parameters`-side cancellation, contradicting the design
|
||||||
|
statement that the timeout covers routed calls (InboundAPI-016); and (4) `RouteHelper`
|
||||||
|
/ `RouteTarget` — the entire WP-4 cross-site routing surface — has no test coverage
|
||||||
|
(InboundAPI-017). New findings are one Medium-trio plus one Low; no Critical or High.
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
@@ -38,11 +57,11 @@ are High severity and should be addressed before production use.
|
|||||||
| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
|
| 2 | Akka.NET conventions | ☑ | Module is ASP.NET-hosted, no actors of its own; routes to actors via `CommunicationService`. No correlation-ID issues — IDs are set in `RouteHelper`. |
|
||||||
| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
|
| 3 | Concurrency & thread safety | ☑ | Singleton `InboundScriptExecutor` mutates a non-thread-safe `Dictionary` from concurrent request threads — see InboundAPI-001/002. |
|
||||||
| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
|
| 4 | Error handling & resilience | ☑ | Catch-all conflates client cancellation with timeout (InboundAPI-004); compilation-failure path repeats work on every request (InboundAPI-009). |
|
||||||
| 5 | Security | ☑ | Non-constant-time key comparison, no trust-model enforcement, no body-size limit, missing-method enumeration oracle — see InboundAPI-003/005/006/011. |
|
| 5 | Security | ☑ | Prior items resolved. Re-review: `ForbiddenApiChecker` is a textual deny-list bypassable via reflection without a forbidden namespace token (InboundAPI-015). |
|
||||||
| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
|
| 6 | Performance & resource management | ☑ | Up to 3 separate DB round-trips per request in `ApiKeyValidator`; uncapped lazy recompilation. |
|
||||||
| 7 | Design-document adherence | ☑ | `Database.Connection()` script API missing; central-only hosting not enforced; lazy-compile diverges from "compiled at startup". |
|
| 7 | Design-document adherence | ☑ | Re-review: `ReturnDefinition` loaded but never used (InboundAPI-014); routed-call timeout not enforced (InboundAPI-016). Prior `Database.Connection()`/central-only items resolved. |
|
||||||
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` is an API-shaped POCO declared in the component project rather than Commons; otherwise conventions followed. |
|
| 8 | Code organization & conventions | ☑ | `ParameterDefinition` moved to Commons (InboundAPI-012 resolved); no new issues. |
|
||||||
| 9 | Testing coverage | ☑ | Good unit coverage of the two validators; no endpoint, concurrency, recompilation, or timeout-vs-cancel tests. |
|
| 9 | Testing coverage | ☑ | Re-review: `RouteHelper`/`RouteTarget` (WP-4 routing) entirely untested (InboundAPI-017); validators/executor/filter well covered. |
|
||||||
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
|
| 10 | Documentation & comments | ☑ | `ApiKeyValidationResult.NotFound` XML/name says "NotFound" but returns HTTP 400 — misleading (InboundAPI-013). |
|
||||||
|
|
||||||
## Findings
|
## Findings
|
||||||
@@ -580,3 +599,181 @@ the new method-not-found status, and removing dead code cannot regress. Doc-owne
|
|||||||
follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a
|
follow-up: `Component-InboundAPI.md`'s Error Handling section still does not list a
|
||||||
"method not found" status; it should note that it is reported as 403 (indistinguishable
|
"method not found" status; it should note that it is reported as 403 (indistinguishable
|
||||||
from "key not approved"), but that doc edit is outside this module's editable scope.
|
from "key not approved"), but that doc edit is outside this module's editable scope.
|
||||||
|
|
||||||
|
### InboundAPI-014 — `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Design-document adherence |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:201-205`, `src/ScadaLink.Commons/Entities/InboundApi/ApiMethod.cs:10` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`Component-InboundAPI.md` ("API Method Definition → Return Value Definition" and the
|
||||||
|
"Response Format" section) specifies that each method has a declared return structure
|
||||||
|
— "Field names and data types … Supports returning lists of objects" — and that the
|
||||||
|
success response body is "the method's return value as JSON, with fields matching the
|
||||||
|
return value definition". The `ApiMethod` entity carries a `ReturnDefinition` column
|
||||||
|
to hold exactly this. However, nothing in the module ever reads `ReturnDefinition`:
|
||||||
|
`ExecuteAsync` takes whatever object the script happens to return and does a blind
|
||||||
|
`JsonSerializer.Serialize(result)`. There is no validation that the script's return
|
||||||
|
value matches the declared shape, no coercion to the declared field types, and no
|
||||||
|
error when a method returns a structure inconsistent with its definition. A method
|
||||||
|
whose script returns the wrong shape (or `null` where a structure is required) will
|
||||||
|
silently emit a malformed 200 response, and the documented return-definition contract
|
||||||
|
is effectively unenforced. This is the response-side mirror of the parameter
|
||||||
|
validation that `ParameterValidator` does perform, leaving the two halves of the
|
||||||
|
method contract asymmetric.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Either (a) implement return-value validation/shaping: parse `ReturnDefinition` with
|
||||||
|
the same extended-type machinery used for parameters and validate/coerce the script
|
||||||
|
result before serializing, returning a 500 (or logging) when the script result does
|
||||||
|
not match; or (b) if return shaping is deliberately out of scope, remove the "Return
|
||||||
|
Value Definition" / "fields matching the return value definition" language from
|
||||||
|
`Component-InboundAPI.md` and document that the response is the script's raw return
|
||||||
|
value serialized as-is. Code and design doc must be reconciled.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### InboundAPI-015 — `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.InboundAPI/ForbiddenApiChecker.cs:63-119`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:109-126` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`ForbiddenApiChecker` walks the script syntax tree and rejects any `using` directive,
|
||||||
|
`QualifiedNameSyntax`, or `MemberAccessExpressionSyntax` whose textual dotted name
|
||||||
|
starts with a forbidden namespace prefix (`System.IO`, `System.Diagnostics`,
|
||||||
|
`System.Reflection`, `System.Net`, etc.). This is a textual match, not a semantic
|
||||||
|
one, and the trust model it enforces (per InboundAPI-005) is explicitly meant to keep
|
||||||
|
*untrusted* Design-role scripts away from host APIs. The check can be bypassed because
|
||||||
|
forbidden functionality is reachable through member access that never spells a
|
||||||
|
forbidden namespace:
|
||||||
|
|
||||||
|
- `typeof(string).Assembly.GetType("System.IO.File")` — `typeof(string)` is permitted,
|
||||||
|
`.Assembly` is a `System.Type` property, `.GetType(string)` is a `System.Reflection.Assembly`
|
||||||
|
method. The string literal `"System.IO.File"` is a string, not a `QualifiedNameSyntax`
|
||||||
|
or `MemberAccessExpressionSyntax`, so `IsForbidden` never sees it. The script obtains
|
||||||
|
a `System.IO.File` `Type` and can `InvokeMember`/`GetMethod(...).Invoke(...)` on it —
|
||||||
|
all via members of permitted types — with no forbidden namespace ever appearing in
|
||||||
|
the source. `CompileAndRegister` references `typeof(object).Assembly`
|
||||||
|
(System.Private.CoreLib) in `ScriptOptions`, so every framework type is loadable at
|
||||||
|
runtime.
|
||||||
|
- The executor also references the `Microsoft.CSharp.RuntimeBinder` assembly
|
||||||
|
(`InboundScriptExecutor.cs:116`), enabling the `dynamic` keyword, which further
|
||||||
|
widens late-bound member access that the static walker cannot see through.
|
||||||
|
|
||||||
|
Because the inbound API script runs on the central node with the host process's
|
||||||
|
privileges and is authored by the (less-trusted-than-Admin) Design role, a static
|
||||||
|
textual deny-list gives a false sense of containment.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Treat the syntax walker as defence-in-depth, not the boundary. Strengthen it where
|
||||||
|
cheap (flag `Assembly.GetType`, `Type.GetType`, `Activator.CreateInstance`,
|
||||||
|
`InvokeMember`, and `dynamic` usage), but for real enforcement run compiled scripts
|
||||||
|
under a genuine boundary — a restricted `AssemblyLoadContext`/AppDomain-equivalent, a
|
||||||
|
curated reference set that does not expose reflection-to-arbitrary-type, or an
|
||||||
|
out-of-process sandbox — consistent with however the Site Runtime ultimately enforces
|
||||||
|
its instance-script trust model. At minimum, document in `Component-InboundAPI.md`
|
||||||
|
that the current check is best-effort and does not stop a determined script.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### InboundAPI-016 — Routed `Route.To().Call()` invocations are not bound by the method timeout
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Design-document adherence |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:59-152`, `src/ScadaLink.InboundAPI/InboundScriptExecutor.cs:177`, `:199` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`Component-InboundAPI.md` states the per-method timeout "defines the maximum time the
|
||||||
|
method is allowed to execute (**including any routed calls to sites**)", and the
|
||||||
|
Routing Behavior section says a routed call "blocks until the site responds or the
|
||||||
|
**method-level timeout** is reached". The executor builds a linked
|
||||||
|
`CancellationTokenSource` (`cts`) combining the request-abort token and a dedicated
|
||||||
|
timeout CTS, and exposes `cts.Token` to the script as `InboundScriptContext.CancellationToken`.
|
||||||
|
However, every `RouteTarget` method (`Call`, `GetAttribute(s)`, `SetAttribute(s)`)
|
||||||
|
takes `CancellationToken cancellationToken = default` and the script must *explicitly*
|
||||||
|
pass the context token for the routed call to honour the timeout. A natural script —
|
||||||
|
`Route.To("inst").Call("doWork", parameters)` — invokes the routed call with
|
||||||
|
`CancellationToken.None`. That request flows into `CommunicationService.RouteToCallAsync`
|
||||||
|
with no cancellation, so the routed call is not bounded by the method timeout at all.
|
||||||
|
The only timeout guard left is `handler(context).WaitAsync(cts.Token)` in
|
||||||
|
`ExecuteAsync`: when the method timeout fires, `WaitAsync` returns a cancellation to
|
||||||
|
the caller, but the underlying script `Task` — and the in-flight `RouteToCallAsync`
|
||||||
|
awaiting a remote site — keeps running orphaned with no cancellation, holding the
|
||||||
|
correlation/communication resources until the site eventually responds or its own
|
||||||
|
transport timeout (if any) fires. The design's guarantee that the method timeout
|
||||||
|
covers routed calls is therefore not met, and a slow/hung site can leak background
|
||||||
|
work past the timeout the caller was told bounds the request.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Make routed calls inherit the method deadline without relying on script discipline:
|
||||||
|
have `RouteHelper`/`RouteTarget` carry the executing method's `CancellationToken`
|
||||||
|
(injected by `InboundScriptExecutor` when it constructs the context, e.g. a
|
||||||
|
`RouteHelper` bound to `cts.Token`) and pass it into every `CommunicationService`
|
||||||
|
call by default, so `Route.To("x").Call("s", p)` is timeout-bounded with no token
|
||||||
|
argument. Keep the explicit-token overload for callers that want a tighter bound.
|
||||||
|
Verify `RouteToCallAsync` and the attribute-routing calls actually observe the token
|
||||||
|
and abandon the in-flight request when it fires.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### InboundAPI-017 — `RouteHelper` / `RouteTarget` has no test coverage
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Testing coverage |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.InboundAPI/RouteHelper.cs:1-165`, `tests/ScadaLink.InboundAPI.Tests/` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`RouteHelper`/`RouteTarget` is the entire WP-4 cross-site routing surface — the
|
||||||
|
`Route.To().Call()/GetAttribute(s)/SetAttribute(s)` API that inbound API scripts use
|
||||||
|
to reach instances at any site. It has zero tests: the `ScadaLink.InboundAPI.Tests`
|
||||||
|
project covers `ApiKeyValidator`, `ParameterValidator`, `InboundScriptExecutor`, and
|
||||||
|
`InboundApiEndpointFilter`, but no test file exercises `RouteHelper`. Untested
|
||||||
|
behaviours include site resolution via `IInstanceLocator` (including the
|
||||||
|
"instance not found / no assigned site" `InvalidOperationException` path at
|
||||||
|
`RouteHelper.cs:154-164`), the `!response.Success` → `InvalidOperationException`
|
||||||
|
translation in each routed method, `GetAttribute` delegating to the batch
|
||||||
|
`GetAttributes` and returning `null` for an absent key, correlation-ID generation,
|
||||||
|
and `SetAttribute` delegating to `SetAttributes`. These are non-trivial branches
|
||||||
|
whose failure modes (a thrown exception inside a script) surface to the caller as a
|
||||||
|
500, so regressions would be silent.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Add a `RouteHelperTests` suite using substituted `IInstanceLocator` and
|
||||||
|
`CommunicationService` (the executor tests already substitute `CommunicationService`):
|
||||||
|
cover the happy path of each routed method, the unresolved-instance throw, the
|
||||||
|
`!Success` → `InvalidOperationException` mapping, and `GetAttribute` returning `null`
|
||||||
|
for a missing key. This also gives InboundAPI-016 a regression home if the timeout
|
||||||
|
wiring is added.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.ManagementService` |
|
| Module | `src/ScadaLink.ManagementService` |
|
||||||
| Design doc | `docs/requirements/Component-ManagementService.md` |
|
| Design doc | `docs/requirements/Component-ManagementService.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 (1 Deferred — see ManagementService-012) |
|
| Open findings | 4 (1 Deferred — see ManagementService-012) |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -28,6 +28,24 @@ instances never disposed) and dead/unused configuration. None of the findings ar
|
|||||||
crash-class, but the site-scope gaps are High severity because they are a real
|
crash-class, but the site-scope gaps are High severity because they are a real
|
||||||
authorization bypass with no workaround.
|
authorization bypass with no workaround.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
All thirteen prior findings remain correctly closed; the source under
|
||||||
|
`src/ScadaLink.ManagementService` is byte-identical between the previously reviewed state
|
||||||
|
and `39d737e` (the resolution commits of findings 001–013 are folded into the history at or
|
||||||
|
before `39d737e`). ManagementService-012 was re-checked and its **Deferred** status still
|
||||||
|
holds: `ManagementEnvelope.Command` is still typed `object`, and the marker-interface fix
|
||||||
|
still belongs in the Commons module, outside this module's edit scope — nothing has changed
|
||||||
|
to make it actionable here. This re-review re-ran the full 10-category checklist against the
|
||||||
|
current sources and surfaced **four new findings**. The dominant theme is the same
|
||||||
|
site-scope authorization gap that findings 001/002 closed: `HandleQueryDeployments`
|
||||||
|
(`QueryDeploymentsCommand`) was overlooked by that sweep and still performs no site-scope
|
||||||
|
enforcement, letting a site-scoped Deployment user read deployment history for any site
|
||||||
|
(014, High). The remaining three are lower severity: a non-atomic multi-override mutation
|
||||||
|
that can leave an instance partially modified after an error (015, Medium), raw exception
|
||||||
|
messages from unexpected faults being returned verbatim to HTTP callers (016, Low), and
|
||||||
|
`QueryDeploymentsCommand` having no test coverage at all (017, Low).
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
@@ -551,3 +569,144 @@ infrastructure, so the testable command-parsing/dispatch logic was extracted int
|
|||||||
`ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5),
|
`ParseCommand` helper and covered instead. Tests: `ParseCommand_*` (5),
|
||||||
`SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the
|
`SerializeResult_*` (2), `UnknownCommandType_FaultMappedToManagementError`, plus the
|
||||||
pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed.
|
pre-existing site-scope and DebugStreamHub suites. `dotnet test` -> 48 passed.
|
||||||
|
|
||||||
|
### ManagementService-014 — HandleQueryDeployments bypasses site-scope enforcement
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | High |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:306`, `:1174` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`QueryDeploymentsCommand` is gated to the `Deployment` role by `GetRequiredRole`
|
||||||
|
(`ManagementActor.cs:170`–`:177`), and the design document's Authorization section states
|
||||||
|
"Site scoping is enforced for site-scoped Deployment users" and explicitly lists deployments
|
||||||
|
among the Deployment-role operations. `HandleQueryDeployments` makes no call to
|
||||||
|
`EnforceSiteScope` / `EnforceSiteScopeForInstance` / `EnforceSiteScopeForIdentifier`: with no
|
||||||
|
`InstanceId` it returns `IDeploymentManagerRepository.GetAllDeploymentRecordsAsync()` (every
|
||||||
|
deployment record across all sites), and with an `InstanceId` it returns that instance's
|
||||||
|
deployment history with no check that the instance's site is within the caller's permitted
|
||||||
|
set. A site-scoped Deployment user scoped to site A can therefore enumerate deployment
|
||||||
|
records for instances at site B — instance IDs, `DeployedBy` (operator usernames),
|
||||||
|
timestamps, deployment status, and `ErrorMessage` content — by issuing `QueryDeployments`
|
||||||
|
with or without an out-of-scope `InstanceId`. This is the same authorization-bypass class as
|
||||||
|
the resolved findings 001/002, on a handler that sweep did not cover; it is `DispatchCommand`'s
|
||||||
|
only `Deployment`-role handler with no scope enforcement.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Thread `AuthenticatedUser` into `HandleQueryDeployments` (the dispatch case at line 306
|
||||||
|
already has `user` in scope). When `cmd.InstanceId` is supplied, call
|
||||||
|
`EnforceSiteScopeForInstance` before querying. When it is not supplied, filter the returned
|
||||||
|
`DeploymentRecord` list to the caller's permitted sites — resolve each record's instance to
|
||||||
|
its `SiteId` (or join through a site-aware repository query) and drop records for sites
|
||||||
|
outside `PermittedSiteIds`, mirroring the `HandleListInstances` / `HandleListSites` filter
|
||||||
|
pattern. Add a regression test for a site-scoped user against in-scope and out-of-scope
|
||||||
|
instances.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ManagementService-015 — HandleSetInstanceOverrides applies overrides non-atomically
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:647`–`:659` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`HandleSetInstanceOverrides` iterates `cmd.Overrides` and calls
|
||||||
|
`InstanceService.SetAttributeOverrideAsync` once per attribute, throwing
|
||||||
|
`InvalidOperationException` on the first `result.IsSuccess == false`. Each
|
||||||
|
`SetAttributeOverrideAsync` call persists independently, so if the command supplies five
|
||||||
|
overrides and the third fails (e.g. an unknown attribute name or a validation error), the
|
||||||
|
first two overrides are already committed to the configuration database while the caller
|
||||||
|
receives a `ManagementError`. The instance is left partially mutated in a state the operator
|
||||||
|
neither sees nor requested, and the per-instance operation lock referenced in the design's
|
||||||
|
deployment decisions does not protect against this because the partial writes are committed
|
||||||
|
before the throw. A retry of the same command then re-applies the already-applied overrides.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Make the multi-override mutation all-or-nothing: either validate every requested override up
|
||||||
|
front before applying any, or apply all overrides within a single transaction / unit-of-work
|
||||||
|
so a mid-batch failure rolls back the earlier writes. If `InstanceService` cannot offer a
|
||||||
|
batch method, at minimum document the partial-application behaviour on `SetInstanceOverridesCommand`
|
||||||
|
and have the handler report which overrides were applied before the failure so the caller
|
||||||
|
can reconcile.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ManagementService-016 — Unexpected exception messages returned verbatim to HTTP callers
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ManagementService/ManagementActor.cs:121`–`:131` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`MapFault` maps any non-`SiteScopeViolationException` fault to
|
||||||
|
`new ManagementError(correlationId, cause.Message, "COMMAND_FAILED")`, and
|
||||||
|
`ManagementEndpoints.HandleRequest` returns that `Error` string directly in the HTTP 400
|
||||||
|
body. For handler-thrown `InvalidOperationException`s carrying a curated `result.Error`
|
||||||
|
message this is intended and safe. But the same path also surfaces the raw `.Message` of
|
||||||
|
unanticipated exceptions — a `SqlException` (which can include server/database names and
|
||||||
|
constraint details), a `DbUpdateException`, an `ArgumentException` from `Enum.Parse` on a
|
||||||
|
malformed `DataType`/`TriggerType` value, or a `NullReferenceException` — straight to the
|
||||||
|
external CLI/HTTP client. This is a minor internal-detail disclosure surface: the exception
|
||||||
|
text is already logged server-side with full context, so the client does not need the raw
|
||||||
|
message.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Distinguish handler-curated failures from unexpected faults. Have handlers throw a dedicated
|
||||||
|
exception type (e.g. `ManagementCommandException`) for messages that are safe to surface, and
|
||||||
|
in `MapFault` return that message for the known type while returning a generic
|
||||||
|
"An internal error occurred (CorrelationId=...)" string for everything else — the operator
|
||||||
|
can still correlate to the server log via the correlation ID.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ManagementService-017 — QueryDeploymentsCommand has no test coverage
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Testing coverage |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `tests/ScadaLink.ManagementService.Tests/ManagementActorTests.cs:1` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`QueryDeploymentsCommand` / `HandleQueryDeployments` is exercised by no test in
|
||||||
|
`ManagementActorTests`. There is no test that it requires the `Deployment` role, no test of
|
||||||
|
the `InstanceId`-filtered versus unfiltered branches, and — because the handler performs no
|
||||||
|
site-scope enforcement at all — no test that would have caught finding 014. The deployment
|
||||||
|
query is one of the operations the design's Authorization section calls out for site
|
||||||
|
scoping, yet it is the only `Deployment`-role command with neither an authorization test nor
|
||||||
|
a site-scope test.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Add tests for `QueryDeploymentsCommand`: a role test (Design/no-role caller →
|
||||||
|
`ManagementUnauthorized`), branch coverage for the `InstanceId`-filtered and unfiltered
|
||||||
|
repository calls, and — once finding 014 is fixed — site-scope tests for a site-scoped
|
||||||
|
Deployment user against in-scope and out-of-scope deployment records.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.NotificationService` |
|
| Module | `src/ScadaLink.NotificationService` |
|
||||||
| Design doc | `docs/requirements/Component-NotificationService.md` |
|
| Design doc | `docs/requirements/Component-NotificationService.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 5 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -30,6 +30,31 @@ plaintext `string` values. Test coverage exercises the happy path and the main
|
|||||||
error branches but misses the OAuth2 delivery path, the permanent-classification
|
error branches but misses the OAuth2 delivery path, the permanent-classification
|
||||||
fallback in `DeliverAsync`, and concurrency on the token cache.
|
fallback in `DeliverAsync`, and concurrency on the token cache.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
Re-reviewed at commit `39d737e` after the NS-001..013 fixes. All twelve prior
|
||||||
|
findings remain closed (eleven Resolved, NS-013 Deferred) and the fixes hold up
|
||||||
|
against the current source — error classification is now typed, the SMTP client is
|
||||||
|
created once and always disconnected/disposed, the S&F `Notification` delivery
|
||||||
|
handler is registered in `AkkaHostedService`, and the OAuth2 token cache is keyed
|
||||||
|
per credential. The re-review surfaced **five new findings**. The recurring theme is
|
||||||
|
**unclassified-exception handling**: the typed `ClassifySmtpError` helper introduced
|
||||||
|
by NS-002/003 returns `Unknown` for anything that is not a recognised SMTP/socket
|
||||||
|
failure, but neither `SendAsync` nor `DeliverBufferedAsync` has a catch-all for the
|
||||||
|
`Unknown` bucket. As a result an OAuth2 token-fetch failure (`HttpRequestException`
|
||||||
|
from `EnsureSuccessStatusCode`, or `InvalidOperationException` from a malformed
|
||||||
|
credential string) escapes `SendAsync` as a raw exception to the calling script
|
||||||
|
(NS-015) and escapes `DeliverBufferedAsync` out of the S&F handler, where the engine
|
||||||
|
treats any thrown exception as transient and retries a permanently-broken config
|
||||||
|
forever (NS-014) — the same defect class NS-008 fixed for address parsing, but the
|
||||||
|
OAuth2 path was never covered. Secondary findings: `AuthenticateAsync` silently
|
||||||
|
proceeds unauthenticated for an unknown `authType` or empty credentials (NS-016);
|
||||||
|
`NotificationOptions` is bound from configuration in two places but never read by
|
||||||
|
any code (NS-017, dead config — NS-007 sourced the timeout/limit from
|
||||||
|
`SmtpConfiguration` instead); and the lazily-created concurrency limiter is read
|
||||||
|
outside its lock, is sized once and never resized on redeployment, and is never
|
||||||
|
disposed (NS-018).
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
@@ -465,3 +490,108 @@ filed) and the genuinely-missing tests were added:
|
|||||||
`AkkaHostedService` handler registration), so no further test is required here.
|
`AkkaHostedService` handler registration), so no further test is required here.
|
||||||
|
|
||||||
Module test suite is green at 56 tests.
|
Module test suite is green at 56 tests.
|
||||||
|
|
||||||
|
### NotificationService-014 — OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | High |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:214-228`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312`, `src/ScadaLink.NotificationService/OAuth2TokenService.cs:56-84` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`DeliverBufferedAsync` is the registered Store-and-Forward delivery handler for `StoreAndForwardCategory.Notification`. Per the S&F handler contract (`StoreAndForwardService.cs:40-53`), a handler must return `true` for success, `false` for permanent failure (park, no further retries), and **throw only for a transient failure** (retry). `DeliverBufferedAsync`'s `try` block (`:214-228`) catches only `SmtpPermanentException`; the comment at `:227` asserts "Transient SMTP errors propagate out of `DeliverAsync`". But `DeliverAsync` can also throw exceptions that `ClassifySmtpError` buckets as `Unknown` — most importantly the OAuth2 path at `:308-312`, where `OAuth2TokenService.GetTokenAsync` throws `HttpRequestException` (from `EnsureSuccessStatusCode` at `OAuth2TokenService.cs:78`) or `InvalidOperationException` (malformed credential triple, or missing `access_token`). None of these is an `SmtpCommandException`/`SmtpProtocolException`/`SocketException`/`TimeoutException`, so they are neither caught nor classified — they propagate straight out of `DeliverBufferedAsync`. The S&F engine treats *any* thrown exception as a transient failure, so a notification whose SMTP config has an invalid client secret (a 401 → `HttpRequestException`, a genuinely **permanent** condition) is retried on every sweep until `MaxRetries` is reached — burning OAuth2 token-endpoint calls and never parking promptly. A malformed credential string (`InvalidOperationException`, never fixable) is likewise retried instead of parked immediately.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Add a catch-all to `DeliverBufferedAsync` for exceptions that `ClassifySmtpError` classifies as `Unknown`: log and return `false` (park) for non-retryable causes such as `InvalidOperationException` from credential parsing, and decide deliberately whether an OAuth2 `HttpRequestException` should park or retry (a 5xx token endpoint is transient; a 401 is permanent — inspect `HttpRequestException.StatusCode`). Do not let an unclassified exception leave the handler and be silently reinterpreted as transient.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### NotificationService-015 — Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | High |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:96-148`, `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:308-312` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`SendAsync`'s `try`/`catch` around `DeliverAsync` has exactly three catch clauses: `SmtpPermanentException` (`:101`), `OperationCanceledException` when cancellation was requested (`:111`), and `Exception` filtered by `IsTransientSmtpError` (`:116`). There is no catch-all for an exception that is none of those. `ClassifySmtpError` returns `SmtpErrorClass.Unknown` for anything it does not recognise, and `IsTransientSmtpError` only matches `Transient`, so an `Unknown`-classified exception falls through all three clauses and escapes `SendAsync` unhandled. Concretely, the OAuth2 branch at `:308-312` calls `OAuth2TokenService.GetTokenAsync`, which throws `HttpRequestException` (token endpoint unreachable / 4xx / 5xx) or `InvalidOperationException` (malformed `tenant:client:secret` triple, or no `access_token` in the response). Both reach the calling script (an instance/alarm/shared script) as a raw exception of a type the `INotificationDeliveryService` contract never advertises — the design doc says delivery returns either success, a buffered transient result, or a permanent `NotificationResult` error. `Notify.To().Send()` against an OAuth2 config with a bad secret therefore crashes the script instead of returning a clean `NotificationResult(false, ...)`. This is the same defect class NS-008 resolved for malformed addresses, but NS-008's `ValidateAddresses` fix never covered the OAuth2 token path. (A non-cancellation `OperationCanceledException` — e.g. an internal MailKit timeout surfaced as OCE while the caller's token is *not* cancelled — would escape the same way.)
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Add a final `catch (Exception ex)` to `SendAsync` that converts any otherwise-unhandled exception into a permanent `NotificationResult(false, ...)` (credential-scrubbed, consistent with NS-009), and log it. Treating an unclassified failure as permanent is the safe default — it returns control to the script rather than crashing it; an OAuth2 transient (token endpoint 5xx) being reported as permanent is an acceptable trade against an unhandled crash, and can be refined by inspecting `HttpRequestException.StatusCode` if transient buffering of token failures is wanted.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### NotificationService-016 — `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:46-67` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`MailKitSmtpClientWrapper.AuthenticateAsync` returns without authenticating in two silent paths. (1) `if (string.IsNullOrEmpty(credentials)) return;` (`:48-49`) — if an SMTP config has `AuthType` "basic" or "oauth2" but a null/empty `Credentials` value (a misconfigured or partially-deployed row), the method returns and `DeliverAsync` proceeds straight to `SendAsync` on an *unauthenticated* connection. (2) The `switch (authType.ToLowerInvariant())` (`:51-66`) has cases only for `"basic"` and `"oauth2"` and **no `default`** — any other `AuthType` value (a typo, a future "ntlm", an empty string) falls through with no authentication attempted. Additionally, in the `"basic"` case, `credentials.Split(':', 2)` that yields fewer than 2 parts (a credential string with no colon) is silently skipped (`:55-58`). In every case the connection then attempts to send mail unauthenticated: against a relay that does not require auth this masks a misconfiguration; against a server that does require auth it fails later with a less obvious error, and at worst an unauthenticated send succeeds where it should not. Authentication being skipped should never be silent.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Make missing/unparseable credentials and an unrecognised `AuthType` hard errors: throw a typed exception (e.g. a permanent configuration exception so `SendAsync`/`DeliverBufferedAsync` surface it as a clean failure / park) rather than returning. Add a `default:` arm to the `switch` that throws `ArgumentException` naming the unsupported auth type, and reject a "basic" credential string that does not split into exactly two parts.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### NotificationService-017 — `NotificationOptions` is bound from configuration but never read (dead config)
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Code organization & conventions |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.NotificationService/NotificationOptions.cs:1-15`, `src/ScadaLink.NotificationService/ServiceCollectionExtensions.cs:10-11`, `src/ScadaLink.Host/SiteServiceRegistration.cs:70` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`NotificationOptions` (with `ConnectionTimeoutSeconds` and `MaxConcurrentConnections`) is bound from the `ScadaLink:Notification` configuration section in two places — `ServiceCollectionExtensions.AddNotificationService` (`AddOptions<NotificationOptions>().BindConfiguration(...)`) and again in `Host/SiteServiceRegistration.cs:70` (`services.Configure<NotificationOptions>(...)`). However, a repo-wide search shows no code ever injects `IOptions<NotificationOptions>` or otherwise reads either property. When NS-007 enforced the connection timeout and concurrency limit, it sourced both values from the per-site `SmtpConfiguration` entity, not from `NotificationOptions` — so this options class is now entirely dead configuration. Its XML doc still claims it "provides fallback defaults and operational limits", which is misleading: nothing falls back to it. The double binding is also redundant. Dead, falsely-documented configuration invites an operator to set `ScadaLink:Notification:ConnectionTimeoutSeconds` and expect it to take effect, when it has no effect at all.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Either delete `NotificationOptions` and both of its registrations, or genuinely wire it in as the documented fallback (e.g. `DeliverAsync` uses `NotificationOptions` values when the `SmtpConfiguration` field is non-positive). If kept, remove the duplicate `Configure` call and correct the XML doc. The NS-007 resolution note already states the `NotificationOptions` fields "remain as operational fallback defaults" — that intent should be implemented or the class removed.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### NotificationService-018 — Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Concurrency & thread safety |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:237-255` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
Three minor issues with the NS-007 concurrency limiter. (1) `GetConcurrencyLimiter` reads the plain (non-`volatile`) field `_concurrencyLimiter` at `:242` *outside* `_limiterLock` as a fast path. The field is only written inside the lock (`:252`), so under the .NET memory model a concurrent thread is not guaranteed to observe the fully-constructed `SemaphoreSlim` via the lock-free read. In practice on x86/x64 this is benign and the double-checked write inside the lock is correct, but the unsynchronized read of a reference set under a lock is the classic lazy-init memory-model pitfall — use `Volatile.Read`/`LazyInitializer`/`Lazy<T>`. (2) The limiter is sized once from the *first* `SmtpConfiguration` ever seen and the `??=` at `:252` means it is never re-created. The design doc says one SMTP config is deployed per site, but that config can be **redeployed** with a different `MaxConcurrentConnections`; the new value is silently ignored for the lifetime of the (scoped, but potentially long-lived per request) service — and more importantly the limit is captured per `NotificationDeliveryService` instance, so behaviour depends on which scoped instance happens to create it first. (3) `SemaphoreSlim` implements `IDisposable`; `_concurrencyLimiter` is never disposed and `NotificationDeliveryService` does not implement `IDisposable`. A scoped service leaking one undisposed `SemaphoreSlim` per scope is a slow handle leak.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Replace the hand-rolled double-checked init with `Lazy<SemaphoreSlim>` or `LazyInitializer.EnsureInitialized`, or at minimum make the field access `Volatile.Read`/`Volatile.Write`. Consider hoisting the limiter to a singleton keyed by site (the limit is a per-site invariant, not a per-scope one) so redeployment with a changed limit and disposal are handled in one place; if it stays on the scoped service, implement `IDisposable` to dispose it.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -40,10 +40,10 @@ module file and counted in **Total**.
|
|||||||
| Severity | Open findings |
|
| Severity | Open findings |
|
||||||
|----------|---------------|
|
|----------|---------------|
|
||||||
| Critical | 0 |
|
| Critical | 0 |
|
||||||
| High | 5 |
|
| High | 8 |
|
||||||
| Medium | 12 |
|
| Medium | 20 |
|
||||||
| Low | 17 |
|
| Low | 27 |
|
||||||
| **Total** | **34** |
|
| **Total** | **55** |
|
||||||
|
|
||||||
## Module Status
|
## Module Status
|
||||||
|
|
||||||
@@ -59,11 +59,11 @@ module file and counted in **Total**.
|
|||||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
||||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
||||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
|
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
|
||||||
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 15 |
|
||||||
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/3/1 | 4 | 17 |
|
||||||
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 17 |
|
||||||
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
| [NotificationService](NotificationService/findings.md) | 2026-05-16 | `9c60592` | 0/2/1/2 | 5 | 18 |
|
||||||
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
| [Security](Security/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/2 | 4 | 15 |
|
||||||
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||||
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
|
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 16 |
|
||||||
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
||||||
@@ -80,7 +80,7 @@ description, location, recommendation — lives in the module's `findings.md`.
|
|||||||
|
|
||||||
_None open._
|
_None open._
|
||||||
|
|
||||||
### High (5)
|
### High (8)
|
||||||
|
|
||||||
| ID | Module | Title |
|
| ID | Module | Title |
|
||||||
|----|--------|-------|
|
|----|--------|-------|
|
||||||
@@ -89,8 +89,11 @@ _None open._
|
|||||||
| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
|
| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
|
||||||
| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
|
| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
|
||||||
| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
|
| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
|
||||||
|
| ManagementService-014 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments bypasses site-scope enforcement |
|
||||||
|
| NotificationService-014 | [NotificationService](NotificationService/findings.md) | OAuth2 token-fetch failure escapes `DeliverBufferedAsync`; a permanently-broken config is retried forever |
|
||||||
|
| NotificationService-015 | [NotificationService](NotificationService/findings.md) | Unclassified exceptions (OAuth2 token fetch, non-cancellation OCE) escape `SendAsync` to the calling script |
|
||||||
|
|
||||||
### Medium (12)
|
### Medium (20)
|
||||||
|
|
||||||
| ID | Module | Title |
|
| ID | Module | Title |
|
||||||
|----|--------|-------|
|
|----|--------|-------|
|
||||||
@@ -106,8 +109,16 @@ _None open._
|
|||||||
| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
|
| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
|
||||||
| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
|
| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
|
||||||
| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
|
| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
|
||||||
|
| Host-012 | [Host](Host/findings.md) | `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read |
|
||||||
|
| InboundAPI-014 | [InboundAPI](InboundAPI/findings.md) | `ReturnDefinition` is loaded but never used; script return value is unshaped/unvalidated |
|
||||||
|
| InboundAPI-015 | [InboundAPI](InboundAPI/findings.md) | `ForbiddenApiChecker` is purely textual and is bypassable via reflection reachable without a forbidden namespace token |
|
||||||
|
| InboundAPI-016 | [InboundAPI](InboundAPI/findings.md) | Routed `Route.To().Call()` invocations are not bound by the method timeout |
|
||||||
|
| ManagementService-015 | [ManagementService](ManagementService/findings.md) | HandleSetInstanceOverrides applies overrides non-atomically |
|
||||||
|
| NotificationService-016 | [NotificationService](NotificationService/findings.md) | `AuthenticateAsync` silently sends unauthenticated for an unknown auth type or empty credentials |
|
||||||
|
| Security-012 | [Security](Security/findings.md) | Partial LDAP failure during login yields a roleless authenticated session |
|
||||||
|
| Security-014 | [Security](Security/findings.md) | `RefreshToken` re-issues a token without checking the idle timeout |
|
||||||
|
|
||||||
### Low (17)
|
### Low (27)
|
||||||
|
|
||||||
| ID | Module | Title |
|
| ID | Module | Title |
|
||||||
|----|--------|-------|
|
|----|--------|-------|
|
||||||
@@ -128,3 +139,13 @@ _None open._
|
|||||||
| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
|
| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
|
||||||
| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
|
| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
|
||||||
| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
|
| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
|
||||||
|
| Host-013 | [Host](Host/findings.md) | `:F0` rounding of cluster timing values silently degrades sub-second configuration |
|
||||||
|
| Host-014 | [Host](Host/findings.md) | Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8) |
|
||||||
|
| Host-015 | [Host](Host/findings.md) | `StartupRetry` retries on every exception type, including permanent failures |
|
||||||
|
| InboundAPI-017 | [InboundAPI](InboundAPI/findings.md) | `RouteHelper` / `RouteTarget` has no test coverage |
|
||||||
|
| ManagementService-016 | [ManagementService](ManagementService/findings.md) | Unexpected exception messages returned verbatim to HTTP callers |
|
||||||
|
| ManagementService-017 | [ManagementService](ManagementService/findings.md) | QueryDeploymentsCommand has no test coverage |
|
||||||
|
| NotificationService-017 | [NotificationService](NotificationService/findings.md) | `NotificationOptions` is bound from configuration but never read (dead config) |
|
||||||
|
| NotificationService-018 | [NotificationService](NotificationService/findings.md) | Concurrency limiter: lock-free read of a non-volatile field, never resized on redeployment, never disposed |
|
||||||
|
| Security-013 | [Security](Security/findings.md) | `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas |
|
||||||
|
| Security-015 | [Security](Security/findings.md) | Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims |
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.Security` |
|
| Module | `src/ScadaLink.Security` |
|
||||||
| Design doc | `docs/requirements/Component-Security.md` |
|
| Design doc | `docs/requirements/Component-Security.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 (1 deferred — Security-008) |
|
| Open findings | 4 (1 deferred — Security-008) |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -28,6 +28,26 @@ on every refresh, weakening the documented 30-minute idle policy. None of these
|
|||||||
crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
|
crash/data-loss bugs, but the TLS, cookie, and key-validation items are security
|
||||||
defects that should be fixed before any production deployment.
|
defects that should be fixed before any production deployment.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
Re-reviewed the module after the batch-resolution of Security-001..007, 009, 010, 011.
|
||||||
|
Those fixes were verified in place and remain Resolved: the `LdapTransport` enum makes
|
||||||
|
StartTLS reachable, the auth cookie is `SecurePolicy.Always` with a sliding window, the
|
||||||
|
JWT signing key is length-validated at construction, issuer/audience are bound and
|
||||||
|
checked, the DN-injection fallback is RFC 4514-escaped, and the idle-timeout anchor is
|
||||||
|
carried across refreshes. Security-008 (N+1 scope-rule query) remains correctly
|
||||||
|
**Deferred** — `ISecurityRepository` still exposes no per-set scope-rule query, so the
|
||||||
|
N+1 cannot be removed from within `RoleMapper.cs` alone; the deferral condition is
|
||||||
|
unchanged. This pass surfaced **4 new findings** (Security-012..015): two Medium and
|
||||||
|
two Low. The most notable is that a *partial* LDAP outage during login (bind succeeds,
|
||||||
|
group search fails) silently returns an authenticated session with **zero roles**
|
||||||
|
instead of failing the login as the design's LDAP-failure rule requires
|
||||||
|
(Security-012); and that `RefreshToken` re-issues a token without ever checking
|
||||||
|
`IsIdleTimedOut`, so an idle-expired session can be renewed indefinitely if the caller
|
||||||
|
omits the separate idle check (Security-014). The two Low findings concern fragile
|
||||||
|
DN parsing of group names containing escaped commas and an un-trimmed username flowing
|
||||||
|
into the LDAP filter, fallback DN, and JWT claims.
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
@@ -456,3 +476,141 @@ suite. Remaining gaps closed here: renamed the test file `UnitTest1.cs` →
|
|||||||
LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases
|
LDAP-timeout coverage (Security-009) plus extra no-service-account / DN-path edge cases
|
||||||
(`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`,
|
(`BuildFallbackUserDn_NoSearchBase_ReturnsBareRdn`, `EscapeLdapDn_LeadingHash_IsEscaped`,
|
||||||
`EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green.
|
`EscapeLdapDn_NullOrEmpty_ReturnedUnchanged`). Full module suite: 54 tests, all green.
|
||||||
|
|
||||||
|
### Security-012 — Partial LDAP failure during login yields a roleless authenticated session
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Security/LdapAuthService.cs:78-118` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
In `AuthenticateAsync`, after the user bind succeeds, the group/attribute `Search` is
|
||||||
|
wrapped in a `try`/`catch (LdapException)` that logs a warning and continues — the
|
||||||
|
comment states "Auth succeeded even if attribute lookup failed". The method then returns
|
||||||
|
`new LdapAuthResult(true, displayName, username, groups, null)` with an **empty
|
||||||
|
`groups`** list. Because `RoleMapper.MapGroupsToRolesAsync` derives all roles from that
|
||||||
|
group list, a login during a *partial* LDAP outage (bind reachable, search/group server
|
||||||
|
unavailable, or a transient `LdapException` mid-search) produces a fully authenticated
|
||||||
|
session with **zero roles** — the user is logged in but silently stripped of all
|
||||||
|
permissions. The design's LDAP Connection Failure rule states new logins must **fail**
|
||||||
|
when LDAP is unavailable; this path instead admits the user. The caller cannot
|
||||||
|
distinguish "user genuinely belongs to no mapped groups" from "the group query failed",
|
||||||
|
so it cannot make the documented fail-the-login decision. The `while` loop's inner
|
||||||
|
`catch (LdapException) { break; }` (line 107-111) compounds this: a real error mid-enumeration
|
||||||
|
is indistinguishable from end-of-results and silently truncates the group list.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Treat a failed group/attribute lookup on the *initial login* as an authentication
|
||||||
|
failure: return `Success = false` with a "directory unavailable, try again" message, or
|
||||||
|
add an explicit flag on `LdapAuthResult` (e.g. `GroupLookupSucceeded`) so the caller can
|
||||||
|
apply the design's fail-the-login rule. The inner `while`-loop `catch` should not treat
|
||||||
|
an arbitrary `LdapException` as a benign end-of-results sentinel — only the specific
|
||||||
|
"no more results" condition should break the loop; any other exception should propagate
|
||||||
|
or be surfaced.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### Security-013 — `ExtractFirstRdnValue` mis-parses group DNs containing escaped commas
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Security/LdapAuthService.cs:258-269` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`ExtractFirstRdnValue` extracts a group name from a `memberOf` DN by taking the substring
|
||||||
|
between the first `=` and the first `,`. RFC 4514 permits a `,` inside an RDN value when
|
||||||
|
it is backslash-escaped — e.g. `cn=Acme\, Inc Operators,ou=groups,dc=example,dc=com`. The
|
||||||
|
method splits on the *escaped* comma and returns `Acme\` as the group name. `RoleMapper`
|
||||||
|
then matches that mangled name verbatim against configured `LdapGroupMapping.LdapGroupName`
|
||||||
|
values, so a group whose CN legitimately contains a comma silently fails to map to its
|
||||||
|
role — the user loses that role with no error. The same naive parsing also does not strip
|
||||||
|
the trailing backslash escape.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Parse the first RDN with an escape-aware scan: ignore a `,` preceded by an unescaped
|
||||||
|
`\`, and unescape RFC 4514 sequences in the extracted value. Alternatively use the LDAP
|
||||||
|
library's DN-parsing API (`LdapDN` / RDN accessors) rather than hand-rolled `IndexOf`.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### Security-014 — `RefreshToken` re-issues a token without checking the idle timeout
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Security/JwtTokenService.cs:156-169` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`RefreshToken` carries the existing `LastActivity` claim forward (correct, per the
|
||||||
|
Security-007 fix) and unconditionally issues a fresh token with a new 15-minute expiry.
|
||||||
|
It never calls `IsIdleTimedOut`. The documented 30-minute idle policy therefore depends
|
||||||
|
entirely on the CentralUI request pipeline calling `IsIdleTimedOut` *before* calling
|
||||||
|
`RefreshToken` — two independent, order-dependent checks with no enforcement that they
|
||||||
|
are used together. If a caller refreshes without first checking idle state (an easy
|
||||||
|
mistake, and there is no compiler or API signal preventing it), an idle-expired session
|
||||||
|
is silently renewed: each refresh resets the JWT expiry while `LastActivity` keeps
|
||||||
|
ageing, so the session can be kept alive indefinitely by background refreshes even
|
||||||
|
though the user has been idle for hours. `RefreshToken` is the natural place to enforce
|
||||||
|
the invariant but provides no safety net.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Have `RefreshToken` itself reject a principal that is already past the idle timeout —
|
||||||
|
return `null` (the same "cannot refresh" signal it already uses for missing claims) when
|
||||||
|
`IsIdleTimedOut(currentPrincipal)` is true — so the idle policy holds regardless of
|
||||||
|
caller discipline. Document that an idle-expired principal cannot be refreshed and add a
|
||||||
|
regression test covering refresh of a 31-minute-idle token.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### Security-015 — Username is not trimmed before use in the LDAP filter, fallback DN, and JWT claims
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.Security/LdapAuthService.cs:20-21`, `:80`, `:122`, `:169`, `:193` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`AuthenticateAsync` rejects null/whitespace-only usernames via `IsNullOrWhiteSpace`, but
|
||||||
|
an otherwise-valid username with leading or trailing whitespace (`" alice"`, copy-paste
|
||||||
|
artefacts, mobile keyboards) is passed through verbatim. It flows into the LDAP search
|
||||||
|
filter `({attr}={EscapeLdapFilter(username)})`, into the fallback bind DN via
|
||||||
|
`BuildFallbackUserDn`, and — most consequentially — into the returned `LdapAuthResult`
|
||||||
|
and from there into the JWT `Username` claim. Most LDAP directories ignore surrounding
|
||||||
|
whitespace on a search term, so `" alice"` and `"alice"` may both authenticate but yield
|
||||||
|
JWT `Username` claims that differ by whitespace. Any downstream identity comparison
|
||||||
|
(audit-log `who`, site-scope lookups keyed by username, session de-duplication) then
|
||||||
|
sees two distinct identities for the same person.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Trim the username once at the top of `AuthenticateAsync` (after the `IsNullOrWhiteSpace`
|
||||||
|
guard) and use the trimmed value consistently for the filter, the DN, and the
|
||||||
|
`LdapAuthResult`. This normalises the identity that ends up in the JWT claim and audit
|
||||||
|
trail.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
Reference in New Issue
Block a user