docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security

21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
This commit is contained in:
Joseph Doherty
2026-05-17 00:48:25 -04:00
parent 89636e2bbf
commit 3b3760f026
6 changed files with 873 additions and 41 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.Host` |
| Design doc | `docs/requirements/Component-Host.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
class, but the readiness bug is High because it breaks load-balancer behaviour with
no safe workaround.
#### Re-review 2026-05-17 (commit `39d737e`)
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
in the current tree — the `/health/ready` predicate, the externalised secrets, the
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
described in their Resolution notes. This re-review walked all ten checklist
categories again over the full module and recorded four new findings, none of them
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
of cluster timing values silently degrades any sub-second configuration (Host-013),
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
exception type including permanent schema-validation failures (Host-015).
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
## Findings
@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
— they assert the configured level actually filters log events; they pass against the
new factory (and would fail against the old inline configuration, which ignored the
key). Full Host suite green (175 passed).
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
**Description**
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
documented as "the design-doc requirement", default `true` — but a repo-wide search
shows it is referenced **nowhere outside its own declaration**. The Host therefore
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
on`. This is the same class of defect as the resolved Host-011
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
declared, bound, and documented but never consumed, which silently misleads any
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
ClusterInfrastructure project owns only the configuration contract, per the
`ClusterOptions` class comment).
**Recommendation**
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
doc comment instead so code and configuration contract agree — but do not leave it
declared-but-dead.
**Resolution**
_Unresolved._
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
**Description**
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0`
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
the change is a correctness hazard for exactly the kind of failure-detection tuning
these options exist for.
**Recommendation**
Render durations without precision loss — emit milliseconds (e.g.
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
validate in `StartupValidator` that each cluster timing value is a positive whole
number of seconds and fail fast otherwise. Either way, do not silently round.
**Resolution**
_Unresolved._
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
**Description**
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
configuration section — but neither `appsettings.json` nor either role-specific file
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
console output template, the file path, and the rolling interval cannot be changed
without recompiling — an operator cannot redirect logs, change the file location, or
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
call is effectively a no-op because the section it reads does not exist.
**Recommendation**
Move the console and file sink definitions into a `Serilog` section in
`appsettings.json` (with `WriteTo` entries and the output template / file path /
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
state the sinks are intentionally code-defined — but the design doc currently says
"configuration-driven", so code and doc must be reconciled.
**Resolution**
_Unresolved._
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
**Description**
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
on a Central node in production *validates* the schema version and throws if the
deployed schema does not match the expected migration set. A schema-version mismatch
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
that is fatal on the first attempt. The retry helper is meant for *transient*
connection faults (the XML doc says exactly that: "the database may be briefly
unreachable"), but it cannot distinguish transient from permanent failures.
**Recommendation**
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
isTransient` predicate and, for the migration call site, treat only
connection-class exceptions (`SqlException` with a connection/transport error
number, `TimeoutException`, socket errors) as retryable while letting
schema-validation / `InvalidOperationException` failures propagate immediately. Add
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
single attempt.
**Resolution**
_Unresolved._