docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security
21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.Host` |
|
||||
| Design doc | `docs/requirements/Component-Host.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -31,20 +31,37 @@ and unguarded string interpolation when building HOCON. None are crash/data-loss
|
||||
class, but the readiness bug is High because it breaks load-balancer behaviour with
|
||||
no safe workaround.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All eleven findings from the first review (Host-001..011) are confirmed `Resolved`
|
||||
in the current tree — the `/health/ready` predicate, the externalised secrets, the
|
||||
seed-node/GrpcPort validation rules, the escaped HOCON builder, the bounded
|
||||
migration retry and the live `LoggingOptions.MinimumLevel` are all present as
|
||||
described in their Resolution notes. This re-review walked all ten checklist
|
||||
categories again over the full module and recorded four new findings, none of them
|
||||
crash/data-loss class. Following up on a batch-1 ClusterInfrastructure re-review
|
||||
note, Host-012 (Medium) confirms `AkkaHostedService.BuildHocon` hard-codes
|
||||
`down-if-alone = on` and never reads the `ClusterOptions.DownIfAlone` property, so
|
||||
that documented, bound option is dead. The remaining three are Low: `:F0` rounding
|
||||
of cluster timing values silently degrades any sub-second configuration (Host-013),
|
||||
Serilog sink setup is hard-coded in `Program.cs` rather than configuration-driven as
|
||||
REQ-HOST-8 requires (Host-014), and `StartupRetry` retries indiscriminately on every
|
||||
exception type including permanent schema-validation failures (Host-015).
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | `/health/ready` includes the leader-only check (Host-001); site seed-node config points at the gRPC port (Host-004). |
|
||||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. HOCON built by raw string interpolation (Host-006); `StartAsync` returns before actors are confirmed running (Host-009). |
|
||||
| 3 | Concurrency & thread safety | ☑ | Blocking `GetAwaiter().GetResult()` on a hosted-service startup thread (Host-005). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Top-level try/catch logs fatal and rethrows. No retry around DB migration / readiness preconditions (Host-010). |
|
||||
| 5 | Security | ☑ | Plaintext DB password, LDAP service-account password and dev JWT key checked into `appsettings.Central.json` (Host-003). |
|
||||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. |
|
||||
| 7 | Design-document adherence | ☑ | REQ-HOST-6 mandates Akka.Persistence config but none exists and no persistent actors exist — doc is stale (Host-002). REQ-HOST-4 GrpcPort-≠-RemotingPort rule not enforced (Host-007). |
|
||||
| 8 | Code organization & conventions | ☑ | `MachineDataDb` validated/declared but never consumed (Host-008). `LoggingOptions.MinimumLevel` is dead (Host-011). |
|
||||
| 9 | Testing coverage | ☑ | Strong suite; no test asserts `/health/ready` excludes `active-node`, which is why Host-001 slipped through (noted in Host-001). |
|
||||
| 10 | Documentation & comments | ☑ | Comments are accurate. REQ-HOST-6 in the design doc is the main stale-doc item (Host-002). |
|
||||
| 1 | Correctness & logic bugs | ☑ | Host-001/004 resolved. Re-review: `:F0` rounding of cluster timing values silently distorts sub-second durations (Host-013). |
|
||||
| 2 | Akka.NET conventions | ☑ | CoordinatedShutdown, receptionist registration, singleton scoping all correct. Host-006/009 resolved; no new issues. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Host-005 resolved (`StartAsync` now genuinely async). `DeadLetterMonitorActor` state is actor-confined — no issues. |
|
||||
| 4 | Error handling & resilience | ☑ | Host-010 resolved. Re-review: `StartupRetry` retries indiscriminately on permanent failures (e.g. schema-validation mismatch) (Host-015). |
|
||||
| 5 | Security | ☑ | Host-003 resolved — secrets externalised to env vars. No new issues. |
|
||||
| 6 | Performance & resource management | ☑ | No undisposed resources. Inbound API script compilation is a synchronous startup loop — acceptable. No new issues. |
|
||||
| 7 | Design-document adherence | ☑ | Host-002/007 resolved. Re-review: `ClusterOptions.DownIfAlone` bound/documented but never consumed — HOCON hard-codes `on` (Host-012); Serilog sinks hard-coded, not config-driven per REQ-HOST-8 (Host-014). |
|
||||
| 8 | Code organization & conventions | ☑ | Host-008/011 resolved. No new issues. |
|
||||
| 9 | Testing coverage | ☑ | Strong suite; regression tests added for Host-001/004/006/007/010/011. No coverage for the new `down-if-alone`, sub-second-duration, or non-transient-retry paths (Host-012/013/015). |
|
||||
| 10 | Documentation & comments | ☑ | REQ-HOST-6 stale-doc resolved. Re-review: REQ-HOST-8 says sinks are "configuration-driven" but they are code-defined (Host-014). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -561,3 +578,153 @@ Regression tests in new `LoggerConfigurationTests`:
|
||||
— they assert the configured level actually filters log events; they pass against the
|
||||
new factory (and would fail against the old inline configuration, which ignored the
|
||||
key). Full Host suite green (175 passed).
|
||||
|
||||
### Host-012 — `down-if-alone` hard-coded in HOCON; `ClusterOptions.DownIfAlone` is never read
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:146-148` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AkkaHostedService.BuildHocon` emits the split-brain-resolver block with
|
||||
`keep-oldest { down-if-alone = on }` as a literal constant. `ClusterOptions`
|
||||
(`src/ScadaLink.ClusterInfrastructure/ClusterOptions.cs:74`) exposes a
|
||||
`DownIfAlone` property — bound from `ScadaLink:Cluster` via the Options pattern,
|
||||
documented as "the design-doc requirement", default `true` — but a repo-wide search
|
||||
shows it is referenced **nowhere outside its own declaration**. The Host therefore
|
||||
ignores the bound value entirely: setting `ScadaLink:Cluster:DownIfAlone` to `false`
|
||||
in `appsettings.json` has no effect, the resolver still runs with `down-if-alone =
|
||||
on`. This is the same class of defect as the resolved Host-011
|
||||
(`LoggingOptions.MinimumLevel` was dead config) — a configuration option that is
|
||||
declared, bound, and documented but never consumed, which silently misleads any
|
||||
operator who edits it. A batch-1 re-review of ClusterInfrastructure flagged the same
|
||||
hard-coding; it is recorded here because `BuildHocon` is Host-owned code (the
|
||||
ClusterInfrastructure project owns only the configuration contract, per the
|
||||
`ClusterOptions` class comment).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `BuildHocon` consume `clusterOptions.DownIfAlone` — emit `down-if-alone =
|
||||
{(clusterOptions.DownIfAlone ? "on" : "off")}` (the value is a bool, so no escaping
|
||||
is needed). Add a `HoconBuilderTests` case asserting both `true` and `false` produce
|
||||
the corresponding `down-if-alone` token. If the flag is genuinely meant to be fixed
|
||||
at `on` for ScadaLink's two-node clusters, remove the `DownIfAlone` property and its
|
||||
doc comment instead so code and configuration contract agree — but do not leave it
|
||||
declared-but-dead.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Host-013 — `:F0` rounding of cluster timing values silently degrades sub-second configuration
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:135-136,145,151-152` |
|
||||
|
||||
**Description**
|
||||
|
||||
`BuildHocon` renders every duration into HOCON via `TotalSeconds:F0` —
|
||||
`transportHeartbeatSec:F0`, `transportFailureSec:F0`, `StableAfter.TotalSeconds:F0`,
|
||||
`HeartbeatInterval.TotalSeconds:F0`, `FailureDetectionThreshold.TotalSeconds:F0`. The
|
||||
`F0` format specifier rounds to whole seconds, so any sub-second configuration is
|
||||
silently distorted: a `HeartbeatInterval` of `00:00:00.5` renders as
|
||||
`heartbeat-interval = 0s` (a degenerate / invalid value that Akka will reject or
|
||||
treat as zero), and `00:00:02.6` becomes `3s`. The shipped defaults are all whole
|
||||
seconds so this is latent, but the configuration model accepts arbitrary `TimeSpan`
|
||||
values and an operator tuning a heartbeat to e.g. `750ms` would get a `1s` value with
|
||||
no warning — or, worse, `0s`. Rounding configured timing values without surfacing
|
||||
the change is a correctness hazard for exactly the kind of failure-detection tuning
|
||||
these options exist for.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Render durations without precision loss — emit milliseconds (e.g.
|
||||
`heartbeat-interval = {ts.TotalMilliseconds:F0}ms`) so sub-second values survive, or
|
||||
validate in `StartupValidator` that each cluster timing value is a positive whole
|
||||
number of seconds and fail fast otherwise. Either way, do not silently round.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Host-014 — Serilog sinks are hard-coded in `Program.cs`, not configuration-driven (REQ-HOST-8)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/Program.cs:43-48`, `src/ScadaLink.Host/appsettings.json:1-7` |
|
||||
|
||||
**Description**
|
||||
|
||||
REQ-HOST-8 requires the Host to configure Serilog with "Configuration-driven sink
|
||||
setup (console and file sinks at minimum)". `LoggerConfigurationFactory.Build` calls
|
||||
`.ReadFrom.Configuration(configuration)`, which reads the standard `Serilog`
|
||||
configuration section — but neither `appsettings.json` nor either role-specific file
|
||||
contains a `Serilog` section (only a Microsoft `Logging` section with a `LogLevel`
|
||||
map, which Serilog's `ReadFrom.Configuration` does not consume). The two sinks that
|
||||
actually run are appended in `Program.cs` as hard-coded `.WriteTo.Console(...)` and
|
||||
`.WriteTo.File("logs/scadalink-.log", rollingInterval: Day)` calls. As a result the
|
||||
console output template, the file path, and the rolling interval cannot be changed
|
||||
without recompiling — an operator cannot redirect logs, change the file location, or
|
||||
add a sink via configuration, contrary to REQ-HOST-8. The `ReadFrom.Configuration`
|
||||
call is effectively a no-op because the section it reads does not exist.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Move the console and file sink definitions into a `Serilog` section in
|
||||
`appsettings.json` (with `WriteTo` entries and the output template / file path /
|
||||
rolling interval as arguments) so `ReadFrom.Configuration` drives them, and drop the
|
||||
hard-coded `.WriteTo` calls from `Program.cs`. Alternatively, update REQ-HOST-8 to
|
||||
state the sinks are intentionally code-defined — but the design doc currently says
|
||||
"configuration-driven", so code and doc must be reconciled.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### Host-015 — `StartupRetry` retries on every exception type, including permanent failures
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.Host/StartupRetry.cs:36-45` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StartupRetry.ExecuteWithRetryAsync` catches `Exception` with only the guard
|
||||
`when (attempt < maxAttempts)` — it retries on *any* exception type. Its sole caller
|
||||
wraps `MigrationHelper.ApplyOrValidateMigrationsAsync` (`Program.cs:124-134`), which
|
||||
on a Central node in production *validates* the schema version and throws if the
|
||||
deployed schema does not match the expected migration set. A schema-version mismatch
|
||||
is a permanent error — retrying it cannot succeed — yet `StartupRetry` will retry it
|
||||
8 times with capped exponential backoff (2s, 4s, 8s, 16s, 30s, 30s, 30s ≈ 2 minutes)
|
||||
before finally rethrowing, delaying the fatal-exit-and-alert by minutes for a fault
|
||||
that is fatal on the first attempt. The retry helper is meant for *transient*
|
||||
connection faults (the XML doc says exactly that: "the database may be briefly
|
||||
unreachable"), but it cannot distinguish transient from permanent failures.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Restrict retries to transient faults — e.g. accept an `Func<Exception, bool>
|
||||
isTransient` predicate and, for the migration call site, treat only
|
||||
connection-class exceptions (`SqlException` with a connection/transport error
|
||||
number, `TimeoutException`, socket errors) as retryable while letting
|
||||
schema-validation / `InvalidOperationException` failures propagate immediately. Add
|
||||
a `StartupRetryTests` case asserting a non-transient exception is rethrown after a
|
||||
single attempt.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user