fix(host): resolve Host-005..011 — async startup, HOCON escaping, port-conflict check, dead-config cleanup, migration retry, log-level wiring; Host-002 flagged
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 8 |
|
||||
| Open findings | 1 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -250,7 +250,7 @@ appropriately before the fix and pass after.
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:345` |
|
||||
|
||||
**Description**
|
||||
@@ -270,7 +270,20 @@ storeAndForwardService.StartAsync(cancellationToken)`. Propagate the
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `RegisterSiteActors`
|
||||
called `storeAndForwardService.StartAsync().GetAwaiter().GetResult()` synchronously
|
||||
inside the `IHostedService.StartAsync` path, blocking a startup thread and wrapping
|
||||
any failure in `AggregateException`. Fix: `AkkaHostedService.StartAsync` is now
|
||||
genuinely `async Task`; the site path was renamed `RegisterSiteActorsAsync` and made
|
||||
`async`; the store-and-forward init is now `await storeAndForwardService.StartAsync()`
|
||||
(the service's own signature, in another module, takes no `CancellationToken`, so the
|
||||
token is honoured via a `cancellationToken.ThrowIfCancellationRequested()` checkpoint
|
||||
immediately before the await). No blocking, no sync-context deadlock risk, and the
|
||||
original exception type now surfaces. No dedicated regression test: the only
|
||||
observable change is non-blocking/exception-unwrapping behaviour on a path that
|
||||
requires the full site DI graph plus a real SQLite store; the existing
|
||||
`HostStartupTests.SiteRole_*` and `CompositionRootTests` continue to exercise the
|
||||
async startup pipeline and remain green (175 passed).
|
||||
|
||||
### Host-006 — HOCON assembled by unescaped string interpolation
|
||||
|
||||
@@ -278,7 +291,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:70-108` |
|
||||
|
||||
**Description**
|
||||
@@ -302,7 +315,25 @@ and seed nodes) before substitution.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: the Akka HOCON was
|
||||
built by an interpolated string injecting `NodeHostname`, `SeedNodes`, roles and
|
||||
`SplitBrainResolverStrategy` directly into the document text with no escaping — a
|
||||
value containing a quote, backslash or whitespace would corrupt or misparse the
|
||||
document. Re-triaged the recommendation: switching wholesale to the typed
|
||||
`Akka.Hosting` builder is a larger refactor than this Low finding warrants and would
|
||||
touch how the whole cluster is bootstrapped, so the HOCON form was retained but made
|
||||
safe. Fix: the HOCON construction was extracted into a testable static
|
||||
`AkkaHostedService.BuildHocon`, and every interpolated *string* value is now routed
|
||||
through a `QuoteHocon` helper that wraps the value in a double-quoted HOCON literal
|
||||
and escapes embedded backslashes and quotes — `NodeHostname`, every seed node, every
|
||||
role, and `SplitBrainResolverStrategy` (the last was previously unquoted entirely, so
|
||||
whitespace alone would break it). Regression tests in new `HoconBuilderTests`:
|
||||
`BuildHocon_HostnameWithQuote_DoesNotCorruptDocument` (a hostname with a `"` still
|
||||
parses and following keys stay intact), `BuildHocon_StrategyWithWhitespace_RemainsQuoted`,
|
||||
and `BuildHocon_PlainValues_ParsesAndPreservesValues`. With the old raw interpolation
|
||||
a quote-containing hostname produced `hostname = "evil"host"`, closing the literal
|
||||
early and corrupting following keys; the escaped form parses correctly. Full Host
|
||||
suite green (175 passed).
|
||||
|
||||
### Host-007 — REQ-HOST-4 rule "GrpcPort ≠ RemotingPort" is not enforced
|
||||
|
||||
@@ -310,7 +341,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/StartupValidator.cs:43-47` |
|
||||
|
||||
**Description**
|
||||
@@ -332,7 +363,18 @@ Add a check in the `role == "Site"` block: if `GrpcPort` (resolved, including th
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed against
|
||||
`StartupValidator.cs`: the `role == "Site"` block validated the GrpcPort range but
|
||||
never compared it to `RemotingPort`, so a site config setting both ports equal
|
||||
passed validation and then failed opaquely when Kestrel and Akka.Remote both bound
|
||||
the port. Fix: added `if (port == grpcPort) errors.Add("ScadaLink:Node:GrpcPort must
|
||||
differ from RemotingPort")` in the Site block, using the resolved GrpcPort (including
|
||||
the 8083 `NodeOptions` default when the key is absent) against the parsed
|
||||
RemotingPort. Regression tests in `StartupValidatorTests`:
|
||||
`Site_GrpcPortEqualsRemotingPort_FailsValidation`,
|
||||
`Site_DefaultGrpcPortEqualsRemotingPort_FailsValidation` (default-8083 path), and
|
||||
`Site_GrpcPortDiffersFromRemotingPort_PassesValidation` — all failed appropriately
|
||||
before the fix and pass after. Full Host suite green (175 passed).
|
||||
|
||||
### Host-008 — `MachineDataDb` is validated and declared but never consumed
|
||||
|
||||
@@ -340,7 +382,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/StartupValidator.cs:33-34`, `src/ScadaLink.Host/DatabaseOptions.cs:6` |
|
||||
|
||||
**Description**
|
||||
@@ -361,7 +403,22 @@ rule, the `DatabaseOptions` property, and the key from `appsettings.Central.json
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: a repo-wide search
|
||||
shows `MachineDataDb` is read nowhere outside the Host module — only `ConfigurationDb`
|
||||
is wired into `AddConfigurationDatabase` — yet `StartupValidator` failed Central
|
||||
startup when it was absent. No machine-data store exists or is registered, so this is
|
||||
dead configuration. Fix (the "if no" branch of the recommendation): removed the
|
||||
`MachineDataDb` validation rule from `StartupValidator`, the `MachineDataDb` property
|
||||
from `DatabaseOptions`, the `MachineDataDb` key from `appsettings.Central.json`, and
|
||||
the now-pointless `MachineDataDb` reference from that file's `_secrets` note and from
|
||||
the `CentralDbTestEnvironment` test helper. Regression test: the prior
|
||||
`StartupValidatorTests.Central_MissingMachineDataDb_FailsValidation` was inverted to
|
||||
`Central_MissingMachineDataDb_PassesValidation` (absence must now NOT fail startup),
|
||||
and `MachineDataDb` was removed from the shared `ValidCentralConfig` fixture so all
|
||||
Central tests prove the key is no longer required; it failed under the old code and
|
||||
passes after. Full Host suite green (175 passed). Note: `appsettings.Central.json`
|
||||
under `docker/central-node-*` is outside this task's edit scope and still carries the
|
||||
key harmlessly — it will simply be ignored; a follow-up may remove it for tidiness.
|
||||
|
||||
### Host-009 — `StartAsync` reports success before role actors are confirmed running
|
||||
|
||||
@@ -369,7 +426,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Akka.NET conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:127-141` |
|
||||
|
||||
**Description**
|
||||
@@ -395,7 +452,24 @@ actor system exists, not that the cluster handshake has completed.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`) — re-triaged, documentation fix. Root cause
|
||||
confirmed: `SetReady` is called optimistically after `ActorOf` registrations whose
|
||||
`PreStart`/initial-contact handshakes run asynchronously. Re-triaged the first
|
||||
recommended option (gating `SetReady` on a `SiteCommunicationActor` `Ask`
|
||||
round-trip): that option is **incorrect** because the meaningful asynchronous step is
|
||||
the `ClusterClient` initial-contact handshake with central — gating site gRPC
|
||||
readiness on it would prevent a site from coming up and streaming locally while
|
||||
central is briefly unreachable, contradicting the design's site-autonomy model.
|
||||
Confirmed `SiteStreamGrpcServer` already rejects any stream opened before `SetReady`
|
||||
with `StatusCode.Unavailable`, so the window is benign. Fix (the second, correct
|
||||
option): the `SetReady` call site now carries an explicit comment documenting the
|
||||
narrow readiness contract — by `SetReady` the actor system exists,
|
||||
`SiteStreamManager.Initialize` has run, and every role actor has been created with
|
||||
ordered synchronous `ActorOf` + registration `Tell`s; what is deliberately NOT
|
||||
guaranteed is `PreStart` completion or the central handshake. No regression test is
|
||||
meaningful for a documentation-only clarification of an intentional design contract;
|
||||
the existing `HostStartupTests.SiteRole_*` cover the surrounding startup path. Full
|
||||
Host suite green (175 passed).
|
||||
|
||||
### Host-010 — No retry/backoff around startup preconditions (DB migration, readiness)
|
||||
|
||||
@@ -403,7 +477,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/Program.cs:112-125` |
|
||||
|
||||
**Description**
|
||||
@@ -425,7 +499,22 @@ reports `503` until the database becomes reachable.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: Central startup called
|
||||
`MigrationHelper.ApplyOrValidateMigrationsAsync` directly with no tolerance for a
|
||||
database that is briefly unreachable at boot — a first connection failure threw, the
|
||||
top-level `catch` logged `Fatal`, and the process exited. Fix: added a new
|
||||
`StartupRetry.ExecuteWithRetryAsync` helper (bounded retry with capped exponential
|
||||
backoff — no Polly dependency, keeping the Host's package set minimal) and wrapped
|
||||
the migration step in it (`maxAttempts: 8`, `initialDelay: 2s`, doubling, capped at
|
||||
30s). Each retry opens a fresh DI scope so a transient connection fault does not
|
||||
leave a poisoned `DbContext`; once attempts are exhausted the last exception
|
||||
propagates and the existing fatal-log path still fires. Regression tests in new
|
||||
`StartupRetryTests`: `ExecuteWithRetry_SucceedsFirstTry_RunsOnce`,
|
||||
`ExecuteWithRetry_TransientFailures_RetriesUntilSuccess` (fails twice then succeeds),
|
||||
and `ExecuteWithRetry_ExhaustsAttempts_RethrowsLastException` (verifies the original
|
||||
exception type/message surfaces after exhaustion) — they fail to compile/pass against
|
||||
the pre-fix code (the helper did not exist) and pass after. Full Host suite green
|
||||
(175 passed).
|
||||
|
||||
### Host-011 — `LoggingOptions.MinimumLevel` is dead configuration
|
||||
|
||||
@@ -433,7 +522,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.Host/LoggingOptions.cs:5`, `src/ScadaLink.Host/Program.cs:42-50` |
|
||||
|
||||
**Description**
|
||||
@@ -456,4 +545,21 @@ configuration section. Keep one mechanism, not two.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Root cause confirmed: `LoggingOptions.MinimumLevel`
|
||||
was bound from `ScadaLink:Logging` but Serilog was configured purely via
|
||||
`ReadFrom.Configuration` (standard `Serilog` section), so editing
|
||||
`ScadaLink:Logging:MinimumLevel` had no effect. Re-triaged the two options: the
|
||||
"remove the option class" branch was rejected because `Component-Host.md` REQ-HOST-3
|
||||
explicitly lists `ScadaLink:Logging` → `LoggingOptions` and that design doc is outside
|
||||
this task's edit scope — removing the class would create a fresh code-vs-doc drift.
|
||||
Fix (the "consume it" branch): added a `LoggerConfigurationFactory.Build` that binds
|
||||
`LoggingOptions`, parses `MinimumLevel` into a Serilog `LogEventLevel` (falling back
|
||||
to `Information` for null/blank/unrecognised values), and pins it via
|
||||
`MinimumLevel.Is(...)` on top of `ReadFrom.Configuration`; `Program.cs` now builds the
|
||||
logger through this factory. `ScadaLink:Logging:MinimumLevel` is now live.
|
||||
Regression tests in new `LoggerConfigurationTests`:
|
||||
`MinimumLevel_Warning_SuppressesInformationLogs`,
|
||||
`MinimumLevel_Debug_AllowsDebugLogs`, and `MinimumLevel_Absent_DefaultsToInformation`
|
||||
— they assert the configured level actually filters log events; they pass against the
|
||||
new factory (and would fail against the old inline configuration, which ignored the
|
||||
key). Full Host suite green (175 passed).
|
||||
|
||||
Reference in New Issue
Block a user